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Abstract 

For output-symmetric DMCs at even moderately high rates, fixed-block-length communication systems show no 
improvements in their error exponents with feedback. In this paper, we study systems with fixed end-to-end delay 
and show that feedback generally provides dramatic gains in the error exponents. 

A new upper bound (the uncertainty-focusing bound) is given on the probability of symbol error in a fixed-delay 
communication system with feedback. This bound turns out to have a similar form to Viterbi's bound used for the 
block error probability of convolutional codes as a function of the fixed constraint length. The uncertainty-focusing 
bound is shown to be asymptotically achievable with noiseless feedback for erasure channels as well as any output- 
symmetric DMC that has strictly positive zero-error capacity. Furthermore, it can be achieved in a delay-universal 
(anytime) fashion even if the feedback itself is delayed by a small amount. Finally, it is shown that for end-to-end 
delay, it is generally possible at high rates to beat the sphere-packing bound for general DMCs — thereby providing 
a counterexample to a conjecture of Pinsker. 

Index Terms 

Feedback, delay, reliability functions, anytime reliability, sphere-packing bounds, random coding, hybrid ARQ, 
queuing, list decoding. 
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Why block length and delay behave differently if feedback is present 

I. Introduction 

The channel coding theorems studied in information theory are not just interesting as mathematical results, they 
also provide insights into the underlying tradeoffs in reliable communication systems. While in practice there are 
many different parameters of interest such as power, complexity, and robustness, perhaps the most fundamental two 
are end-to-end system delay and the probability of error. Error probability is fundamental because a low probability 
of bit error lies at the heart of the digital revolution justified by the source/channel separation theorem. Delay is 
important because it is the most basic cost that must be paid in exchange for reliability — it allows the laws of 
large numbers to be harnessed to smooth out the variability introduced by random communication channels. 

In our entire discussion, the assumption is that information naturally arises as a stream generated in real time at 
the source (e.g. voice, video, or sensor measurements) and it is useful to the destination in finely grained increments 
(e.g. a few milliseconds of voice, a single video frame, etc.). The acceptable end-to-end delay is determined by 
the application and can often be much larger than the natural granularity of the information being communicated 
(e.g. voice may tolerate a delay of hundreds of milliseconds despite being useful in increments of a few milliseconds). 
This is different from cases in which information arises in large bursts with each burst needing to be received by 
the destination before the next burst even becomes available at the source. 

Rather than worrying about what the appropriate granularity of information should be, the formal problem is 
specified at the individual bit level. (See Figure [TJ) If a bit is not delivered correctly by its deadline, it is considered 
to be erroneous. The upper and lower bounds of this paper turn out to not depend on the choice of information 
granularity, only on the fact that the granularity is much finer than the tolerable end-to-end delay. 
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fixed delay d — 7 

Fig. 1. The timeline in a rate-^ code with decoding delay 7. Both the encoder and decoder must be causal in that the channel inputs Xi 
and decoded bits Bi are functions only of quantities to the left of them on the timeline. If noiseless feedback is available, the Xi can also 
have an explicit functional dependence on the channel outputs Yf -1 that lie to the left on the timeline. 

In the next section of this introduction, the example of the binary erasure channel at R = | bits per channel 
use is used to constructively show how fixed-delay codes can dramatically outperform fixed-block-length codes at 
the same rates when feedback is present. Existing information-theoretic views of feedback and reliability are then 
reviewed in Section JI] Section HIT] states the main results of the paper, with the constructions and proofs following 
in subsequent sections. Numerical examples and plots are also given in Section [TTTJ to illustrate these results. 

Section ITVl generalizes Pinsker's result from [1] for non-block-code performance with fixed delay and also explains 
why, contrary to Pinsker's assertion, this argument does not generalize to the case when feedback is present. The 
new upper bound (the "uncertainty-focusing bound") on fixed-delay performance is proved in Section [V] by reviving 
Forney's inverse concatenation construction to serve this new purpose. Asymptotic achievability of this new bound 
with noiseless feedback is shown in Section [VTJ for erasure channels. These results are extended in Section IVIII to 
general DMCs. It turns out that for channels with strictly positive feedback-zero-error capacity, a low-rate error- 
free path can be constructed with very little overhead thereby attaining the performance of the uncertainty-focusing 
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bound. For generic channels at high message rates, the overhead of this approach is non-negligible but the error 
probability still asymptotically beats that predicted by the sphere-packing bound for the same end-to-end delay. 



A. A simple example using the BEC 

The natural question of end-to-end delay in situations with finely grained information was considered by Pinsker 
in [1]. He explicitly treats the BSC case, while asserting that the results hold for any DMC. The main result 
(Theorem 5 in [1]) is that the sphere-packing bound E sp {R) is an upper bound to the fixed-delay error exponent 
for any nonblock code. Theorem 8 in [1] asserts that the same bound continues to hold even with feedback. As 
reviewed in Section III-AI these theorems parallel what is already known to hold for fixed-block-length codes. 

The binary erasure channel (BEC) with erasure probability fi <\ used at rate R' = ^ bits per channel use gives 
a counterexample to Pinsker's generalized conjecture. The BEC is so simple that everything can be understood with 
a minimum of overhead. A counterexample that covers the BSC itself is given later in Section IVII-EI (plotted in 
Figure [TOl) and others are given in [2], [3] using much more involved codes built around control-theoretic ideas. 

The sphere -packing bound in the BEC case corresponds to the probability that the channel erases more than ^ 
of the inputs during the block: 

^ c( i| W ._Mfffi_-«) 

For (3 = 0.4, this yields an error exponent of about 0.02. Even with feedback, there is no way for a fixed-block-length 
code to beat this exponent. If the channel lets fewer than | bits through, it is impossible to reliably communicate 
an ^-bit message! Bit-error vs block-error considerations alone do not change the overall picture since they buy at 
most a factor of - in the average probability of error — nothing on an exponential scale. 

With noiseless feedback, the natural nonblock code just retransmits a bit over the BEC until it is correctly 
received. To be precise, as bits arrive steadily at the rate R' = ^ bits per channel use, they enter a FIFO queue of 
bits awaiting transmission. At time 0, both the encoder and decoder know that there are no bits waiting. From that 
time onward, the bit arrivals are modeled here as deterministic and come every other channel use. Since both the 
encoder and decoder know when a bit arrives as well as when a bit is successfully received, there is no ambiguity 
in how to interpret a channel output. 

If the queue length is examined every two channel uses, exactly one new bit has arrived while the channel may 
have successfully served 0, 1, or 2 bits in this period. Thus, the length of the queue can either increase by one, 
stay the same, or decrease by one. The queue length can be modeled (see Figure [2]) as a birth-death Markov chain 
with a f3 2 probability of birth and a (1 — (3) 2 probability of death. The steady state distribution of the queue length 
is therefore 7Tj = ft(yr^) 2 * where k is the normalization constant (1 — (yr^) 2 ). 
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Fig. 2. The birth-death Markov chain governing the rate-| communication system over an erasure channel with feedback. The scheme 
merely retransmits bits until successful reception. 

To understand the probability of error with end-to-end delay, just notice that the only way a bit can miss its 
deadline is if it is still waiting in the queue. If it was a bit from d time steps ago, the queue must currently hold 
at least | bits. The steady state distribution reveals that the asymptotic probability of this is: 

Converting that into an error exponent with delay d gives 

^ ec (i)=ln(l-/3)-ln(/3). (2) 

Plugging in = 0.4 reveals an exponent of more than 0.40. This is about twenty times higher than the sphere- 
packing bound! Simple computations can verify that the ratio of (O to £T|) goes to infinity as j3 — > \. 
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To help get an intuitive idea for why this happens, it is worthwhile to consider an idealized feedback-free code for 
erasure channels (the reader may find it helpful to think of packet erasure channels with large alphabets). Suppose 
that the encoder causally generated "parities" of all the message symbols so far with the property that symbols 
could be decoded whenever the receiver had as many unerased parities as there were undecoded symbols^ The 
queue size can be reinterpreted in this setting as the number of additional parities required before the decoder could 
solve for the currently uncertain message symbols. The queue's renewal times correspond to the times at which 
the decoder can solve for the current set of undecoded message symbols. 



number of undecoded bits 




time (in BEC uses) 



Fig. 3. A simulated run of an 0.4 erasure channel using an idealized linear causal code without feedback. The red upper sawtooth represents 
the number of current message symbols that are still ambiguous at the decoder while the lower curve represents the number of additional 
parities that would enable it to resolve the current ambiguity. The lower curve is not coincidentally also the queue size for the natural 
FIFO-based code with feedback. The dotted line at 6 represents a potential delay deadline of 12 time units. 

Figure [3] illustrates the backlog of undecoded bits in a simulated run of a rate-| code over a channel with erasure 
probability 0.4. Figure [4] zooms in on a particular segment of time corresponding to an "error event" and shows the 
differences between how the feedback-free code and feedback code make progress. During an error event in which 
the channel is erasing too many symbols, progress at the decoder seems to stop entirely in the feedback-free code, 
only catching up in a sudden burst when the error event ends. By contrast, the code with feedback makes visible, 
but slower, progress at the decoder even during these error events. As a result, it is able to meet the target delay 
deadline whereas the code without feedback misses it. This example also shows how the delays in the feedback-free 
code are related to the inter-renewal times of the queue, while the delays in the code with feedback are related to 
the length of the queue itself. 

Stepping back, this example illustrates that Pinsker's bound with delay does not generally apply when feedback is 
available. Instead, fixed-delay nonblock codes can dramatically outperform fixed-block-length codes with feedback. 
Moreover, it is possible to glimpse why this occurs. Reliable communication always takes place at message rates 
R that are less than the capacity C. In a fixed-delay setting with feedback, the encoder has the flexibility to do 
flow control based on what the channel has been doing in the past. It can vary the short-term operational rate 
R — in effect stealing channel uses from later bits to make sure that earlier bits meet their looming deadlines, 
while still hoping that the later bits will be able to meet their later deadlines. This flexibility is missing in the 
fixed-block-length setting because all the bits in the block are forced to share a common deadline. 

This can also be seen by contrasting the total conditional entropy H(B^ d ^ R \Y* +d ) of the message bits Bf^, d ' R 
given the channel outputs Y* +d to the sum Ylk=iW H(Bk\Yl +d ) of the marginal conditional entropies of the bits 

'This is in the style of rateless block coding [4], except that the message bits are revealed to the encoder in time rather than being known 
all at the beginning. 
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Fig. 4. A zoomed-in look at the simulation of Figure [3] showing the total number of decoded symbols as a function of time. The thin 
upper curve is the total number of symbols that have been received at the rate-i encoder. The next lower line is the total number of symbols 
decoded by the code with feedback. The lowest curve corresponds to the code without feedback. The thin dotted line represents the deadline 
of 12 time steps. Whenever the decoder curves are below this curve, they are missing the deadline. 
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given the channel outputs. If the channel misbehaves slightly and makes it hard to distinguish only a single pair 
of bit strings, the marginal entropies H(Bk\Yl +d ) can become large even as the total conditional entropy is small. 
Such situations are common without feedback. From the decoder's perspective, the feedback encoder's strategy 
should be to focus the uncertainty H(E>i^ d ^ R \Y* +d ) onto later bits ^fc^VR'-A to P a y f° r reducing it on earlier 
bits. The sum of the marginal conditional entropies can then be made the same as the total conditional entropy. 

The total delay experienced by a bit can also be broken into two components: queuing delay and transmission 
delay. For the erasure channel, the transmission delay is just a geometric random variable governed by an exponent 
of — ln(/3). This transmission exponent does not change with the message rate. The queuing delay is the dominant 
term, and its exponent does change with the message rate. 

Finally, it is interesting to examine the computational burden of implementing this simple code. At the encoder, 
all that is needed is a FIFO queue that costs a constant (assuming memory is free) per unit time to operate. The 
decoder has similar complexity since it too just tracks how many bits it has received so far in comparison with the 
number of bits known to have arrived at the encoder. The computational burden does not change with either the 
target delay or the quality of the channel! 



II. Background 

A. Fixed-length codes 

Traditionally, reliable communication was first explored in the context of block codes [5]. If physical information 
sources are considered to produce bits steadily at R' bits per second, then the use of a block code of length n 
channel uses (with channel uses assumed to occur once per second) contributes to end-to-end delay in two ways. 

• Enough bits must first be buffered up to even compute the codeword. This takes no more than n seconds and 
can take less if the block code is systematic in nature. 

• The decoder must wait for n seconds to get the n channel outputs needed to decode the block. This second 
delay would be present even if the source bits were realized entirely in advance of the use of the channel. 

In this context, the fundamental lower bound on error probability comes from the sphere-packing bound. To 
understand this bound, it is helpful to think about the message block as representing a certain volume of entropic 
uncertainty that the decoder has about the message. The objective of using the channel is to reduce this uncertainty. 
Let P be the transition matrix (p y \ x is the probability of seeing output y given input x) for the DMC. Each channel 
use can reduce the uncertainty on average by no more than the capacity 

C(P) = max I(q,P) (3) 

where I(q, P) is the mutual information between input and output of channel P when q is the input distribution 
and is defined by 

I(q, P)=J2 & S>»l* ln ■ (4) 

X y Ek<lkPy\k 

With or without feedback, successful communication is not possible if during the block, the memoryless channel 
acts like one whose capacity is less than the target message rate. Following [6], [7], for fixed-block-length codes 
this idea immediately gives the following upper bound (referred to as the Haroutunian bound throughout this paper) 
on the block-coding error exponent (limsup,^^ ~ " ): 

E + (R) = inf supD(G\\P\r) (5) 

G:C(G)<R f 

9y\x ln (6) 

^.w^x, - y m Vy\x 

where D(G\\P\f) is the divergence term that governs the exponentially small probability of the true channel P 
behaving like channel G when facing the input distribution r. The divergence is defined as 

D{G\\P\r) = £ r x £ 9 y \ x hA (7) 

x y p y\ x 
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Without feedback, the encoder does not have the flexibility to change the input distribution in response to the 
channel's behavior. The optimization can take this into account to get the bound traditionally known as the sphere- 
packing bound 

E SV (R) = max min D (G\\P\r) . (8) 

Py ' f G:I(f,G)<R ^ " ' ' 

It is clear that E sp (R) < E + (R) and Figure [5] illustrates that the inequality can be strict. 



Haroutunian bound 
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Fig. 5. The sphere-packing and Haroutunian bounds for the Z-channel with nulling probability 0.5. The upper curve is the Haroutunian upper 
bound for the error exponent of block codes with feedback and the lower curve is the classical sphere-packing bound. Both approach zero 
very rapidly around the capacity of 0.223 nats per channel use. Due to the asymmetry of the Z-channel, the capacity-achieving distribution 
is not the same as the sphere-packing-bound-achieving distribution. 

It is often useful to use an alternate form for E sp (R) given by [8] 

E sp (R) = max [E (p) - P R] (9) 

p>0 

with the Gallager function E${p) defined as: 

E (p) = max E (p,q), 



Eo(p,q) = -ln^ 



1 



(i+p) 

(10) 



Since the random-coding error exponent is given by 

E r (R) = max \E (p) - pR], (11) 
0<p<l 

it is clear that the sphere -packing bound is achievable, even without feedback, at message rates close to C since 
for those rates, p < 1 optimizes both expressions [8]. 

It is less well appreciated that the points on the sphere -packing bound where p > 1 are also achievable by random 
coding if the sense of "correct decoding" is relaxed. Rather than forcing the decoder to emit a single estimated 
codeword, list decoding allows the decoder to emit a small list of guessed codewords. The decoding is considered 
correct if the true codeword is on the list. For list decoding with list size I in the context of random codes, Problem 
5.20 in [8] reveals that 

E r>e (R) = max [E (p) - pR] (12) 

is achievable. At high message rates (where the maximizing p is small), there is no benefit from relaxing to list 
decoding, but it makes a difference at low rates. 
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Fig. 6. The sphere-packing bound divided up into two sections: a blue segment where list decoding is needed for random codebooks to 
achieve it, and a red segment where lists are not needed. The tangents represent list sizes of 8, 4, 2, and 1. 



Figure [6] illustrates the range of exponents for which list decoding is required for a BSC. The blue part of the 
sphere -packing curve shows where list decoding is important and the red part shows where lists are not required. 
Four tangents are illustrated corresponding to list sizes of 8, 4, 2, and 1. The y-intercepts of these tangents represent 
the maximum error exponents possible using those list sizes and random codes. 

For output-symmetric channels (see Defmition l3.il ). it is clear that E + (R) = E sp (R) since the input distribution 
r can always be chosen to be uniform [9]. Thus, for fixed-block-length codes and output-symmetric DMCs, not 
only does causal feedback not improve capacity, it does not improve reliability either, at least at high rates@ 

The extreme limit of reliability in the fixed-block-length setting is given by the study of zero-error capacity, in 
which the probability of decoding error is required to be exactly zero. As pointed out in [13], this can be different 
with and without feedback. For zero-error capacity, the details of the channel matrix P are not important as it 
clearly only depends on which entries are zero. The true zero-error capacity without feedback Cq is very hard to 
evaluate, but the zero-error capacity with feedback Cqj can be easily evaluated when it is greater than zero [14]. 

Although there is an explicit expression for Cqj in [13], the interpretation is more straightforward in the context 
of ©. 

C 0/ =l im ^ (13) 

p^oo p 

was established in [15] by evaluating the limit and showing that it is identical to the expression for Cqj from 
[13]. If Cqj is nonzero, both the sphere -packing bound © and Haroutunian bound (0) are infinite at message rates 
below Cqj and finite above it. 

B. Variable-length codes 

Since feedback neither improves the capacity nor significantly improves the fixed-block-length reliability function, 
it seemed that this particular reliability somehow represented the wrong technical question to ask. After all, it was 
unable to answer why feedback seemed to be so useful in practice. The traditional response to this was to fall back 
to the issue of complexity. 

Because classical decoding of fixed-block-length codes has a complexity that is not linear in the block length, 
the block length was viewed as a proxy for implementation complexity rather than only for end-to-end delay. Just 

2 Notice how the situation for unconstrained DMCs is dramatically different from the behavior of the AWGN channel with noiseless 
feedback for which Schalkwijk and Kailath showed double-exponential reliability with block length [10], [11]. However, those results rely 
crucially on the variable nature of an input constraint that only has to hold on average. An unconstrained DMC is more like an AWGN 
channel with just a hard amplitude constraint on the channel inputs [12]. 
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as in variable-length source-coding, the idea in variable-block-length channel-coding is to extend use of the channel 
when the channel is behaving atypically. This way, the presumed complexity of increased block lengths is only 
experienced rarely and on average, the system can be simpler to operate. 

Without feedback, a variable-length mode of operation is impossible since the encoder has no way to know if 
the channel is behaving typically or atypically. With noiseless feedback, the length of the codeword can be made to 
vary based on what the channel has done so far — as long as this variation depends only on the received channel 
symbols. This is the counterpart to the unique decodability requirement in source coding in that both are needed 
to prevent an irrecoverable loss of synchronization between the encoder and decoder. 

One proposed error exponent for variable-length channel codes divides the negative log of the probability of 
block error e by the expected block length E[N e ] of an average mte-R variable-length code [16]. 

= lim sup -Jj^. 

Burnashev gave an upper bound to this exponent by using martingale arguments treating the ending of a block 
as a stopping time and studying the rate of decrease in the conditional entropy of the message at the receiver [16]. 
This gives 

E v (R) = C 1 (l-%) (14) 



C 



where C is the Shannon capacity of the channel and 



Ci = maxD(P(-|x)||P(-|x / )) = maxVp^ln-^ (15) 

x,x> x,x> p y \ x , 

represents the maximum divergence possible between channel output distributions given choice of two input letters. 

While Burnashev gives an explicit variable-length scheme in [16] that asymptotically attains the exponent of 
([14l . the scheme of Yamamoto and Itoh in [17] is simpler and makes clear the idea of separating reliability from 
efficiency. Suppose there is a single message of nR nats to send: 

1) Transmit the message using any reliable block code at a rate R < C close to capacity but larger than the 
target average rate R. This will consume n& channel uses. 

Ft 

2) Use the noiseless feedback to decide at the encoder whether the message was received correctly or incorrectly. 

3) If the message was received correctly, send a "confirm" signal by sending input x from (fT3T ) repeated nil — ■§) 
times. Otherwise, use the channel to send a "deny" signal by repeating input x the same number of times. 
This part can be interpreted as a sort of punctuation: a "deny" is a backspace telling the decoder to erase 
what it has seen so far while a "confirm" is a comma telling the decoder that this block is finished. 

4) The decoder performs a simple binary hypothesis test on the received confirm/deny channel outputs to decide 
whether to accept the current message block. If it rejects the block, then the encoder will retransmit it until 
it is accepted. Since errors only occur when the message is falsely accepted, the decoder minimizes the 
probability of false alarm while holding the probability of missed detection to some acceptably low level. 

Since retransmissions can be made as rare as desired as long as R < C, the overall average rate R of the scheme 
approaches R. Since the number of slots for the "confirm/deny" message can be made to approach n(l — 
the reliability approaches (fl4l by Stein's Lemma [18]. Our approach to generic channels in Section \VII-E\ can 
be considered as using variable-block-length codes to achieve good fixed-delay performance by combining an 
alternative approach to punctuation with a softer sense of retransmission. 

The Burnashev exponent is dramatically higher than the fixed-block-length exponents (see Figure [8) and thus 
seems to demonstrate the advantage of feedback. However, it is unclear what the significance of average delay or 
block length really is in a system. The block length under the Yamamoto and Itoh scheme is distributed like a 
scaled geometric random variable. Consequently, the block length will exceed a target deadline (like an underlying 
channel's coherence time or an application-specific latency requirement) far more often than the scheme makes an 
undetected error. There are also no known nontrivial separation theorems involving either average block length or 
average delay. 
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C. Nonblock codes 

Another classical approach to the problem of reliable communication is to consider codes without any block 
structure. Convolutional and tree codes represent the prototypical examples. It was realized early on that in an 
infinite-constraint-length convolutional code under ML decoding, all bits will eventually be decoded correctly [8]. 
Given that this asymptotic probability of error is zero, there are two possible ways to try to understand the underlying 
tradeoffs: look at complexity or look at the delay. 

The traditional approach was to focus on complexity by examining the case of finite constraint lengths. The per- 
symbol encoding complexity of a convolutional code is linear in the constraint length, and if sequential decoding 
algorithms are used and the message rate is below the cutoff rate Eq(1), so is the average decoding complexity 
[19]. With a fixed constraint length u, the probability of error cannot go to zero and so it is natural to consider 
the tradeoff between the error probability and constraint length v. Viterbi used a genie-aided argument to map 
the sphere-packing bound for block codes into an upper bound for fixed-constraint-length convolutional codes. (A 
variant of this argument is used in Section [V] to bound performance with delay.) This gives the following parametric 
upper bound for the exponent governing how fast the bit error probability can improve with the constraint length: [20] 

E c (R) = E (p) ; R=^- (16) 

P 

where p > 0. The "inverse concatenation construction" (illustrated in Figure [7]) is the graphical representation of 
the above curve — it is the envelope of the (R, E) intercepts traced out by the tangents to the sphere-packing 
bound. Thus, this upper bound can be tightened in the low-rate regime by using the "straight-line bound" from 
[21]. The bound ( fT6l ) is also achievable in the high-rate regime (R > E (l)) [19]. 

The E C {R) from (fT6l ) for fixed constraint lengths is substantially higher than E sp (R) from (|9]) for fixed block- 
lengths. This was used to argue for the superiority of convolutional codes over block codes from an implementation 
point of view. However, it is important to remember that this favorable comparison does not hold when end-to-end 
delay, rather than complexity, is considered. 

If the end-to-end delay is forced to be bounded, then the bit-error probability with delay is governed by E r (R) 
for random convolutional codes, even when the constraint lengths are unbounded [22]. This performance with delay 
is also achievable using an appropriately biased sequential decoder [23]. A nice feature of sequential decoders is 
that they are not tuned to any target delay — they can be prompted for estimates at any time and they will give 
the best estimate that they have. Thus an infinite-constraint-length convolutional code with appropriate sequential 
decoding achieves the exponent E r (R) delay universally over all (sufficiently long) delays. This property turns out 
to be important for this paper since such codes are used in place of two-point block codes to encode punctuation 
information in Section IVII-EI 

The role of feedback in nonblock codes has also been investigated considerably by considering a variety of 
different schemes [24], [25], [26], [27], [28], [29], [30], each with an idiosyncratic way of defining a relevant error 
exponent. The simplest approach is to consider a variable-constraint-length model in which complexity is counted 
by the expected number of multiply-accumulate operations that are required to encode a new channel symbol. This 
is done in Appendix HI The result is that for all rates below the computational cutoff rate, a finite amount of expected 
computation per input bit is enough to get an arbitrarily low probability of error — that the computational error 
exponent is infinite. 

At first glance, this infinite exponent seems to show the superiority of variable-constraint-length codes over 
variable-block-length codes with feedback. After all, the Burnashev bound (fl4l is only infinite for channels whose 
probability matrices P contains a zero. However, this is not a fair comparison since it is comparing expected 
per-channel-use computational complexity here with expected block length in the variable-block-length case. 

The variable-block-length schemes of Ooi and Wornell [31], [32] achieve linear complexity in the block length 
for the message-communication part. Once complexity is linear in the expected length, it is constant on an average 
per-symbol basis. Thus block codes can also achieve any desired probability of error by adjusting the length of the 
confirm/deny phase in the same way that a large enough terminator d can be chosen for the variable-constraint-length 
convolutional codes of Appendix U So both have infinite computational error exponents with feedback. 

An infinite exponent just means that the asymptotic tradeoff of probability of error with expected per-symbol 
computation is uninteresting when noiseless feedback is allowed. As a result, it is very natural to consider the tradeoff 
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with end-to-end delay instead. The open questions that are addressed in this paper are whether the end-to-end delay 
performance can generally be improved using feedback, and if so, what are the limits to such improvements. 

III. Main results and examples 

First, some basic definitions are needed. Vector notation x is used to denote sequences x™ where the indices are 
obvious from the context. 

Definition 3.1: A discrete time discrete memoryless channel (DMC) is a probabilistic system with an input 
and an output. At every time step t, it takes an input xt £ X and produces an output y t G y with probability 
V(Yt = y\Xt = x) = p y \ x . Both X,y are finite sets and the transition probability matrix P containing the p y \ x 
entries is a stochastic matrix. The current channel output is independent of all past random variables in the system 
conditioned on the current channel input. 

Following [8, page 94], a DMC is called output-symmetric if the set of outputs y can be partitioned into disjoint 
subsets^] in such a way that for each subset, the matrix of transition probabilities has the property that each row is 
a permutation of each other row and each column is a permutation of each other column. 

Definition 3.2: A rate-R encoder £ without feedback is a sequence of maps {£t}- Each £ t : {0, 1}L B '*J — > X 

I R't 

where the range is the finite set of channel inputs X . The i-th map takes as input the available message bits B\ 
where R' = is the encoder's rate in bits rather than nats per channel use. 

For a rate-R encoder with noiseless feedback, the maps £ t : y 1 ' 1 x {0, 1}L R '*J — > X also get access to all the 
past channel outputs Y*~ . 

A delay-d rate-R decoder is a sequence of maps {T>i\. Each T>i : y^w~\ +d — > {0, 1} where the output of each 

map is the estimate Bi for the i-th bit. The z-th map takes as input the available channel outputs Y 1 R '^ +d . This 
means that it can see d time units (channel uses) beyond when the bit to be estimated first had the potential to 
influence the channel inputs. 

Randomized encoders and decoders also have access to random variables Wt denoting common randomness 
available in the system. 

Definition 3.3: The fixed-delay error exponent a is asymptotically achievable at message rate R across a noisy 
channel if for every delay dj in some strictly increasing sequence indexed by j there exist rate-i? encoders & and 
delay-dj rate-i? decoders T> J that satisfy the following properties when used with input bits Bi drawn from iid fair 
coin tosses. 

1) For the j-th code, there exists an ej < 1 so that V{Bi ^ Bi(dj)) < ej for every bit position i > 1. The 
B{(dj) represents the delay-dj estimate of Bi produced by the (£•?,£>■?) pair connected through the channel 
in question. 

2) Hindoo ^p 2 - > a 

The exponent a is asymptotically achievable universally over delay or in an anytime fashion if a single encoder 
£ can be used simultaneously for all sufficiently long delays d. 



A. Main results 

With these definitions, the five main results of this paper can be stated: 

Theorem 3.1: For a DMC, no fixed-delay exponent greater than the Haroutunian bound (a > E + (R) from (O) 
is asymptotically achievable without feedback. 

Theorem 3.2: Uncertainty-focusing bound: For a DMC, no delay exponent a > E a {R) is asymptotically achiev- 
able even if the encoders are allowed access to noiseless feedback. 

E a (R)= mf (17) 
0<A<1 1 - A 

3 Notice how Gallager's definition of output-symmetric channels slightly generalizes the symmetric channel definitions of Dobrushin [9] 
and Csiszar and Kbrner [6, page 114]. Such output-symmetric channels can be understood as convex combinations of symmetric channels, 
each with its own distinct output alphabet. Knowledge of the partition the output lands in just tells the decoder which of the symmetric 
channels it happens to be encountering, but does not reveal anything about the channel input itself. 
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where E + is the Haroutunian bound from ©. Whenever E + {R) = E sp (R) (e.g. the DMC is output-symmetric), 
E a (R) = E ajS (R) where the latter is expressed parametrically as 

E a)S (R) = E (rj), (18) 

R = E ° {V) 
V 

where Eq(i]) is the Gallager function from (fTOb . and r/ ranges from to oo. 

The curve (TT8l has negative slope of at least 2C/ 9 in the vicinity of the (C, 0) point where the derivatives 
of E are taken fixing the capacity-achieving distribution. 

Theorem 3.3: For the binary erasure channel with erasure probability > 0, there exists a code using noiseless 
feedback with a delay error exponent that asymptotically approaches the uncertainty-focusing bound E a (R) for all 
message rates R < C. Viewed as a reliability-dependent capacity, the tradeoff is given by 

rv 

C'(a) = -. r (19) 

/ 1-/3 x 



a + log 2 



l-2°/3 



where a is the desired reliability (in base 2) with fixed delay and C'(a) is the supremal rate (in bits per channel 
use) at which reliable communication can be sustained with fixed-delay reliability a. 

Furthermore, for every r > 2 ~ 1 ° ( ^| lo ^f — (in particular: any r > as long as (3 < j^), at all rates R' < 
bits per channel use, the error exponent (in base 2) with respect to delay is > log 2 — 2/3 r . 

Theorem 3.4: For any DMC with strictly positive zero-error capacity Cqj > 0, it is possible to asymptotically 
approach all delay exponents within the region a < E a ^ s {R) defined by (fl8l) using noiseless feedback and 
randomized encoders, even if the feedback is delayed by a constant <p channel uses. 

This rate/reliability region can also be asymptotically achieved for any DMC by an encoder/decoder pair that has 
access to noiseless feedback if it also has access to an error-free forward communication channel with any strictly 
positive rate. 

Furthermore, the delay exponents can be achieved in a delay-universal or "anytime" sense. 

As is shown in Section Ivnl the scheme that approaches the uncertainty-focusing bound is built around a variable- 
length channel code with the zero-error aspects used to convey unambiguous "punctuation" information that allows 
the decoder to stay synchronized with the encoder. Without any zero-error capacity, this punctuation information 
can be encoded in a separate parallel stream of channel uses to give the following result. 

Theorem 3.5: For any DMC, it is possible with noiseless feedback and randomized encoders to asymptotically 
achieve all delay exponents a < E'{R) where the tradeoff curve is given parametrically by varying p 6 (0, oo): 

E ' {P) = (20) 

R(p) = — 

The curve © has strictly negative slope -E (1)/(C - (^§^)) in the vicinity of the (C,0) point. 
Furthermore, these delay exponents are also achievable in a delay-universal or "anytime" sense. 

The fact that this achievable region (l2Qb generically has strictly negative slope in the vicinity of (C, 0) while the 
Haroutunian bound E + and sphere -packing bound E sp both generically approach (C, 0) only quadratically with 
zero slope establishes that noiseless feedback generally improves the tradeoff between end-to-end delay and the 
probability of error. 

The above results relate to the strict interior of the region defined by E a ^ s {R) or E'(R) for achievability and the 
strict exterior region corresponding to E a (R) for the converse. Unlike the case of fixed-block-length codes where 
the sphere -packing bound is known to be achievable at high rates, the results above do not cover points on the 
E ajS (R) curve itself at any rates. 

The results of Theorems 13.31 13.41 and 13.51 are also stated using asymptotic language — they apply in the limit 
of large end-to-end delays. In the case of Theorems 13.41 and 13.51 the parameters defining the randomized codes are 
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also allowed to get asymptotically large in order to approach the delay-error-exponent frontier. However, the proofs 
use techniques that make it possible to evaluate the performance of schemes with finite parameters. 

B. Numerical examples 
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Fig. 7. The binary erasure channel with j3 = 0.4. The vertical dashed line represents the rate of | bits per channel use while the horizontal 
dashed line is the ultimate limit of — ln(0.4) for the reliability function. Notice how the uncertainty-focusing bound gets very close to that 
ultimate bound even at moderately small rates. The triangle illustrates the "inverse concatenation construction" connecting the two bounds 
to each other. 

The erasure channel is the simplest channel for understanding the asymptotic tradeoffs between message rate, 
end-to-end delay, and probability of error when noiseless feedback is allowed. Figure [7] illustrates how when the 
erasure probability f3 is small, even moderately low rates achieve spectacular reliabilities with respect to fixed delay. 

Now, consider a binary symmetric channel with crossover probability 0.02. The capacity of this channel is 
about 0.60 nats per channel use. Figure [8] shows how the different choices of A used in the bound (ITTb kiss the 
uncertainty-focusing bound for the BSC. It also shows the Burnashev bound for variable-block-length coding for 
comparison. In this particular plot, the Burnashev bound appears to always be higher than the uncertainty-focusing 
bound. Figure [9] illustrates that this is not always the case by plotting both bounds in the high-rate regime for a BSC 
with crossover probability 0.003. It is unknown whether any scheme can actually achieve fixed-delay reliabilities 
above the Burnashev bound since the scheme of Theorem I3.5I does not do so. 

The gap between the uncertainty-focusing bound and the scheme of Theorem I3.5I is illustrated in Figure [10] for 
the BSC. This also shows how the sphere-packing bound is significantly beaten at high rates even when the channel 
has no zero-error capacity and thus provides an explicit counterexample to Pinsker's Theorem 8 in [1]. Examples 
showing how the uncertainty-focusing bound is met for communication systems with strictly positive zero-error 
capacity are deferred to Section IVII-FI 

IV. Upper-bounding the fixed-delay reliability function without feedback 

This section proves Theorem 13.11 giving a generalization of Pinsker's BSC argument from [1] to the case of 
general DMCs. The Haroutunian exponent E + (R) from ([6]) is shown to upper bound the reliability function with 
delay if feedback is not available. For output-symmetric DMCs, this is the same as the sphere-packing bound 
E sp (R). Furthermore, we discuss why this proof does not go through when feedback is present. 

The complete proof spans the next few sections with some technical details in the Appendices. 



13 




Rate (in nats per channel use) 



Fig. 8. The sphere-packing, uncertainty-focusing, and Burnashev bounds for a BSC with crossover probability 0.02. The thin lines represent 
the parametric bounds in J 1 7b setting A = i, i, | and the thick middle curve is the uncertainty-focusing bound — the lower envelope of 
the parametric bounds over all A. 
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Rate (in nats) 



Fig. 9. The uncertainty-focusing and Bumashev bounds for a BSC with crossover probability 0.003. 




Fig. 10. The delay-reliability bounds for the BSC used with noiseless feedback. The sphere-packing bound approaches capacity in a 
quadratically flat manner while the uncertainty-focusing bound and the scheme of Theorem 13.51 both approach the capacity point linearly, 
albeit with different slopes. The random-coding error exponent is also plotted for convenience since it beats the scheme of Theorem 13.51 at 
low rates and can be attained without feedback. 
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A. Feedforward decoders and their equivalent forms 

For notational convenience, assume that R' < 1 so that at least one channel use comes between each message 
bit's arrival. If R' > 1, the same argument will work (at the cost of uglier notation) by considering the incoming 
bits to arrive in pairs, triples, etc. Theorem 13.11 is proven by considering a more powerful class of decoders that 
have access to extra information that can only improve their performance. 

Definition 4.1: A delay-drate-R decoder T> with feedforward information is a decoder T>i : {0, l} 4 ^ 1 x}^— — > 

{0, 1} that has noiseless access to the past message bits B 1 ^ 1 in addition to the available channel outputs Y 1 R '~^ +d . 
The first property is that with access to the feedforward information, it suffices to ignore very old channel outputs. 

Lemma 4.1: For a memoryless channel, given a rate-B encoder £ without feedback and a delay-<i rate-B decoder 
T>i with feedforward for bit i, there exists a decoder T>{ : {0, x y d+1 — » {0, 1} for bit i that only depends 

on all the past message bits B 1 ^ 1 and the recent channel outputs y/fO +d . The bit error probability V[Bi ^ 

I R' I 

V{ (£j-\Y" r r F 1+d )) < V{Bi + Vi{B\~ l ,Y^ +d )) assuming that the message bits B are all iid fair coin tosses. 

R' I 

Proof: The result follows immediately from the following Markov chain that holds since there is no feedback. 

Y^- 1 - Bj" 1 ^^ 1-1 " B l x\j] +d Y^ +d . (21) 

I R' I I R' I 

To see the result explicitly, let £>™ ap be the MAP decoder for bit i based on feedforward information B\~ l and 
observations Y, +d . 



z (c) 



argmaxP(Bi = h\B\~ l = b^ 1 , y/«' 1+d = yf*' 1 



argmax £ P(B ? , = bl ,B^ = b^,B^ = b ^\ Y ^ +d = y[^ +d ) 

"•+1 



argmax ^(B*" 1 = y/"' 1 1 = y[*' ] X ) 
6i 

V p(B i+{dR ' ] = b i+ V R '^)>p(Y^ ]+d = J^ +d \B^ dR '^ = ^^y^l- 1 = y^ 1 " 1 ) 

,i+rdn'i 

argmax V P^* 1 ^ = y\f\ +d \B\ + ^ = b[ + ^ , Y^ 1 = y^ 1 ) 

h 4 I H' I I r' I 



■ i+r<iH'l 



argmax £ P^j^ = y\t} +d \x[^ +d = £(b^ dR \ B ^ = ft^y/i^ = y^ 1 ) 

b ^ Rl ' r 1 ' 



b i+1 



argmax £ P# = yfj = £ (b\ + ^)) 



The first few lines above are standard expansions of probability in the MAP context and use the fact that the 
message bits are drawn iid. (a) holds by dropping terms that do not depend on the exact values for 6* + ^ dR ~* and 
thus do not impact the argmax. (b) uses the fact that the channel input X is entirely determined^ by the message 
bits B for an encoder without feedback, (c) is due to the memoryless nature of the channel. 

4 Note that the same argument would also work if the encoder and decoder are allowed to share common randomness. 
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Define Vf directly as 



argmax \^ ^{Xts. 
b — 4 ' Rl 



h ,-t[dR>-\ 



This decoder only depends on the recent channel outputs in addition to the feedforward information and achieves 
MAP performance. Since MAP is optimal, the probability of bit error would be the same or better than any other 
decoder. 



The second property is that it suffices to feedforward the error sequence Bi = B t + Bi mod 2 rather than the 
past message bits themselves. 

Lemma 4.2: Given a rate-i? encoder 8 and delay-ci rate-i? decoder T>i for bit i with feedforward. There exists 
another decoder T>i : {0, l}*" 1 x 3^— ~\+ d {0,1} that only depends on the error sequence B 1 ^ 1 in addition 

UULpULS 1 



to the channel outputs Y 1 R . If Bi = Bi + Bi mod 2, then the outputs of the two decoders are identical 

T> i {B\-\Y^ +d ) = V i {B\-\Yl^ d ). 
Proof: This holds very generally by induction. Neither memorylessness nor even the absence of feedback is required. 
It clearly holds for i = 1 since there are no prior bits and so the same B\ results. Assume now that it holds for 
all j < k and consider i = k. By the induction hypothesis, the action of all the prior decoders j can be simulated 



since the decoder has access to B 3 X 



and Y, 



to recover Bj itself. Since B 1 
as a subroutine to give B^. 



k-l 



. The resulting estimates Bj for j < k can be XORed with Bj 
can be recovered from the given information, the original decoder can be run 





Lemmas 14. 1 1 and |4!21 tell us that feedforward decoders can be thought in three ways: having access to all past 
message bits and all past channel outputs, having access to all past message bits and only a recent window of past 
channel outputs, or having access to all past decoding errors and all past channel outputs. 

B. Constructing a rate-(R — 5\) block code 

Consider the system illustrated in Figure [TT] The message bitstream consisting of fair coin tosses is encoded using 
the given rate-i? encoder. The channel outputs are decoded using the delay-d rate-ii decoders with feedforward, 
with the feedforward in the form of the error signals B by Lemma |4~2~1 These error signals are generated by XORing 
the message bits with the output of an equivalent feedforward decoder. Finally, the feedforward error signals are 
used one more time and combined with the estimates B to recover the message bits B exactly. It is immediately 
clear that this hypothetical system never makes an error from end to end. 



B- 
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Fig. 11. The relevant "cutset" illustrated. For the message bits B to pass noiselessly across the cutset, the sum of the mutual information 
between X and Y and the entropy of B must be larger than the entropy of the B. The mutual information between X and Y is bounded 
by the capacity of the noisy channel and the entropy of B provides a lower bound to the probability of bit errors. 

Now, this system will be interpreted as a block code. Pick an arbitrarily small 5\ > 0. To avoid cumbersome 
notation, some integer effects will be neglected. For every delay d, pick a block length n = For notational 

convenience, let 5^ be such that = R' — S[ so that n(R' — 5[) = nR' — dR' . 
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The data processing inequality implies: 

Lemma 4.3: Suppose n is the block length, the block rate is R — S\ nats per channel use, the X™ are the channel 
inputs, the YJ 1 are the channel outputs, and the B^ R are the error signals coming from the underlying rate-i? 
delay-d encoding and decoding system. Then 

H(Bf R '~ S[) ) > n(R - <5i) - /(Xf; Yi n ). (22) 

Proof: See Appendix III- A I 

C. Lower-bounding the error probability 

Now, suppose this system of Figure [TT] were to be run over the noisy channel G that minimizes © at R — 28± 
nats per channel use. Since the capacity of G is at most R — 28\ nats per channel use and there is no feedback to 
the encoder, the mutual information between the channel inputs and outputs is upper-bounded by 

I(X$ ; Y{ 1 ) < n{R - 2ft) = n{R - ft) - nft. (23) 

Plugging (l23l into (l22l from Lemma 1431 gives 

H{B r l {R '~ 5[) ) > nft. (24) 

Since the sum of marginal entropies Y^h=i H(Bi) > H(B^ R Sl ^), the average entropy of the error bits Bi 
is at least R f^ s , > 0. Consider i* whose individual entropy H(Bi-) > R rzj' ■ 

By the strict monotonicity of the binary entropy function for probabilities less than i, there exists a 82 > so 
that the probability of bit error T(Bi* = 1) = V{B^ ^ Bi*) > 62- While the specific positions i* might vary for 
different delays d, the lower bound 82 on minimum error probability does not vary. 

At this point, Lemma 14.11 implies that even if the channel G were used only for the d + 1 time steps from 
[[■^7], [-^7] +d], the same minimum error probability 82 must hold, regardless of how large d is. For each possible 
message prefix b^, there is an error event A(bf) corresponding to the channel outputs that would cause erroneous 

decoding of the i*-fh bit. Formally, A(l%) := {y\j\ +d \v{, (bf -1 , y\f\ +d ) + h*}. 
Averaging out the probability of error over message prefixes gives 

82 < £-^woiBf =i O 

2 -L(r^l+^)«'Jp(A(6f )\X = £(b[ il ^ +d3)R,i )). 

LCr^rl+^Jfl'J 
"1 

Since the average over messages ^ R'~^ +d ^ R ^ j s a t i eaS |; § 2> an( j t h e probabilities can be no bigger than 1 and no 
smaller than 0, at least a ^ proportion of messages result in the A(bf) having a conditional probability of at least 
^ if channel G is used. 

All that remains is to show that the probability of this event under the true channel P cannot be too small. To 
distinguish between the probability of an event when using channel P or channel G, subscripts are used with Vp 
used to refer to the probability of an event when the channel is P and Vg used for when the channel is G. 

This simple lemma is useful: 

Lemma 4.4: If under channel G and input sequence x, the probability Vg^X\ G A\Xf = x) > 8 > 0, then for 
any e > 0, there exists do(e, 8, G, P) so that as long as d > do(e,8,G, P), the A event's conditional probability 
using channel P must satisfy Vp(Yf € A\Xf = x) > | exp(— d(D(G\\P\r) + e) where f is the type of x. 
Proof: See Appendix III-Bi 
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Given G and an arbitrary e > 0, apply Lemma l4~4l to consider delays d > do- This reveals that 

V P {Br %.) = 2-^^)R'\ Vp U(bf)\X = S(b[ il ^ +di)R ' } )\ 

> (o) exp(-(d,- + l)(maxD(G| |P|r) +e)) 

o r 

= (6) ^ exp (-(^ + 1)(E + (R - 2ft) + e)) . 
(a) follows from the fact that a proportion ^ of the messages must have probability of bit error of at least with 



the final factor of 2 coming from Lemma 14.41 Since the local type of the channel input is unknown, the maximum 
is taken over the channel input type r. (b) is using the definition of G and the Haroutunian bound. 

Since e > is an arbitrary choice and 62 does not depend on the delay d, taking logs quickly reveals that the 
error exponent with delay cannot be any larger than E + (R — 26i). For any a > E + (R), it is always possible to 
pick a < St < z -? so that a > E+(R - 2<$i) as well since the Haroutunian bound E + is continuous in 
the rate for all rates strictly below Shannon capacity and above the feedback zero-error capacity Co,/. Thus, no 
exponent a > E + (R) can be asymptotically achieved and Theorem 13.11 is proved. □ 



D. Comments 

For output-symmetric channels, E + (R) = E sp (R) and so the usual sphere -packing bound is recovered in the 
fixed-delay context. Since E sp (R) is achieved universally with delay at high rates by using infinite-length random 
time-varying convolutional codes, this means that such codes achieve the best possible asymptotic tradeoff between 
probability of bit error and end-to-end delay. However, the proof in the previous section does not get to the 
sphere-packing bound for asymmetric channels like the Z-channel plotted in Figure [5] 

We could apply the sphere-packing bound to the n-length block-code by trying the G channel that optimizes 
E sp (R — 25i — 7) for one of the block codeword compositions r that contains at least exp(n(i? — 5± — 7)) codewords 
for some small 7 > that can be chosen after Si. As a result, there would be weak bits whose probabilities of 
error are at least S2 when used with the G channel. The problem arises when we attempt to translate this back to 
the original channel P. Because the local (d + l)-length input-type is unknown in the vicinity of these weak bits, 
we would only be able to prove an exponent of 

E + (R)= inf ma Xj D(G(-|x)||P(-|x)). (25) 

G:maXf>..i(ft lP )>Rl(r,G)<R x 

This is formally better than (J6]) since there is slightly more flexibility in choosing the mimicking channel G. It 
now just has to have a mutual information across it lower than R when driven with an input distribution that is 
good enough for the original channel. But, it seems unlikely that d25l ) is tight the way that E sp is since for the 
Z-channel, it can evaluate to the same thing as ©. 

It is more interesting to reflect upon why this proof does not go through when feedback is available. This reveals 
why Pinsker's assertion of Theorem 8 in [1] is incorrect. Although the lack of feedback was used in many places, 
the most critical point is Lemma |4~T| which corresponds to [1, Eqn. (39)]. When feedback is present, the current 
channel inputs can depend on the past channel outputs, even if we condition on the past channel inputs. Thus it is 
not possible to take a block error and then focus attention on the channel behavior only during the delay period. It 
could be that the atypical channel behavior has to begin well before the bit in question even arrived at the encoder. 
This is seen clearly in the BEC case with feedback discussed in Section lLAl — the most common failure mode is 
for a bit to enter finding a large queue of senior bits already waiting and then finding that service continues to be 
so slow that the senior bits are not all able to leave the queue before the bit's own deadline expires. 

V. Upper-bounding the fixed-delay reliability function with feedback 



To prove Theorem 13.21 and get a proper upper bound to the fixed-delay reliability function when feedback is 
allowed, we need to account for the fact that the dominant error event might begin before the bit in question even 
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arrives at the encoder. To do this, Viterbi's argument from [20] is repurposed to address delay rather than constraint 
length. We call this upper bound the "uncertainty-focusing bound" because it is based on the idea of focusing the 
decoder's uncertainty about the message bits given the channel outputs onto bits whose deadlines are not pending. 

To bound what is possible, a fixed-delay code is translated into a fixed-block-length code. A lower bound on 
error probability for block codes is then pulled back to give a bound on the probability of error for the original 
fixed-delay code. The key difference from the previous section is that the block-length n is not automatically made 
large compared to the delay. Rather, each different block length provides its own bound at all rates, with the final 
bound at any given rate and delay coming from optimizing over the block length. 

Proof: Given a code with fixed delay d, pick an arbitrary < A < 1 and set the block length n = As 
illustrated in Figure [l2j this implies that n = An + d = jzr\d + d. To avoid cumbersome notation, integer effects 
are ignored here. When d is small, the fact that the block length must be an integer limits our choices for A in an 
insignificant way. 

The block decoder operates by running the delay-ci decoder. This decodes the first XnR' bits, thus making the 
effective rate for the block code XR' bits per channel use or \R nats per channel use. The encoder just applies the 
given causal encoders with feedback using the actual message bits as the first XnR' bits. Random coin tosses can 
be used for the final (1 — X)nR' inputs to the encoders since these will not be decoded anyway. 

Past channel behavior Future behavior 

An | (1 — X)n 



XR'n 



bits whose deadline is within block bits to ignore 



Fig. 12. Using the fixed-delay code to make a block code of length n: only the first XR'n bits are decoded by the end of the block and 
so the rate is cut by a factor of A. The error exponent with block length n is 1 — A of the exponent with the delay d. 

Let Bl nR ' be the original message consisting entirely of independent fair coin tosses. The Haroutunian bound 
reveals that given any 6\, e > there exists a sufficiently large block length n\ and a constant K, so that as long 
as n > rix, this fixed-block-length code with feedback must have a probability of block error that is lower bounded 
by [7] 

V P {B^ nR ' ^ B$ nR ') > Kexp (-n[E + (XR - Si) + e]) . (26) 

Substitute in n = and then notice that there must be at least one message bit position i* whose probability 
of bit error is no worse than times the probability of block error. This gives 

/r, , ft n (l-X)K ( ir E+(XR-6i) 
V P {B t , ?Bi.)> 1 ' exp -d[ \ -± + 



XR'd c \ 1 1 - A 1 — A 

Since the 4 term in front is dominated by the exponential and Si , e are arbitrarily small and A was arbitrary, 
taking logs and the limit d — > oo proves ( fTTj ). 

Whenever E + {R) = E sp (R), by using © and following arguments identical to those used in the analysis of 
convolutional codes, ( fTTT ) turns into (fT8l ). These arguments are given in Appendix III-CI for completeness. 

Expanding (fl~8T > by Taylor expansion in the vicinity of n = 0, noticing that the first derivative of E there is the 



& 1 E (0) 

capacity C, and applying simple algebra leads to the negative slope of 2C / — q°2 in the vicinity of the (C, 0) 
point. When the second derivative term is equal to zero, then [8] reveals that the channel's sphere -packing bound 
hits (C, 0) at a positive slope of at least — 1 and thus (fT8l ) evaluated at n = 1 already has hit the capacity. There is 
no need to consider lower values of rj. The uncertainty-focusing bound in such cases jumps discontinuously down 
to zero at rates above capacity. □ 
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It is also important to notice that the core idea driving the proof is the inverse-concatenation construction from 
[20] and [22]. This allows us to map an upper bound on the fixed-block-length reliability function into an upper 
bound on the fixed-delay reliability. As a result, the uncertainty-focusing bound can also be used for channels 
without feedback. 

Corollary 5.1: For a DMC, no fixed-delay exponent greater than the expurgated bound at rate (a > E ex (0) 
from [8]) is asymptotically achievable without feedback. 

Proof: Because the straight-line bound [8] can tighten the low-rate exponent for block-codes without feedback, this 
means that it can also be used to tighten the bound for fixed-delay codes in the low-rate regime. The inverse 
concatenation construction immediately turns the straight-line bound for fixed-block-length codes turns into a 
horizontal line at E ex (0) for fixed-delay codes. □ 

Thus the best upper bound we have for the reliability function for end-to-end delay in a system without feedback 

is mm(E ex {0),E + {R)). 

For the case of output-symmetric channels with feedback (or whenever the E a>s bound is tight), it is also possible 
to explicitly calculate the worst case A* in parametric form using the arguments of Appendix III-CI 

, dE a (p=rj) ^ 

\* \ d P I V dE {p = ri) 

R( V ) e ( v ) [ dp j - {Z,) 

The exponentially dominating error event involves j^-d of the past channel outputs as well as the d time steps in 
the future — for a error event length of 1 d x , . Thus A* captures the critical balance between how badly the channel 
must misbehave and how long it must misbehave for. In general, when R is near C, the rj will be near zero. Since 
9e °(p-°) = c, this implies A* there will be near 1, and the dominant error events will be much longer than the 
desired end-to-end delay. 

VI. ACHIEVABILITY OF THE FIXED-DELAY RELIABILITY WITH FEEDBACK FOR ERASURE CHANNELS 

This section proves Theorem 13.31 and thereby demonstrates the asymptotic achievability of E a (R) = E a ^ s {R) 
everywhere for erasure channels with noiseless feedback. 



A. The optimal code and its reliability 

The optimal scheme for the binary erasure channel with instantaneou^l causal noiseless feedback is intuitively 
obvious — buffer up message bits as they arrive and attempt to transmit the oldest message bit that has not yet 
been received correctly by the receiver. What is not immediately obvious is how well this scheme actually performs 
with end-to-end delay. 

The Markov-chain analysis in Section II-AI becomes unwieldy at rates that are not simple rational numbers like 
2- In [34], [35], an analysis of this scheme is given by translating the communication problem into a problem of 
stabilization of an unstable scalar plant over a noisy feedback link using techniques from [36]. The stabilization 
problem can then be studied explicitly in terms of its 77-th moments, which can be understood using certain infinite 
sums. The dominant terms in these sums are found using heuristic arguments (rigorous only for rj = 2, 3) and 
the convergence of those reveals which ?y-moments are finite. This in turn implicitly gives a lower bound to the 
reliability function with delay. It turns out that this calculation agrees with the uncertainty-focusing bound. In the 
following section, a direct and rigorous proof is given for Theorem 13.31 at all rates. 

A BEC with erasure probability (3 is output-symmetric and so the Haroutunian bound and the sphere -packing 
bound are identical. Evaluating the symmetric uncertainty-focusing bound (fT8T ) gives the following parametric 
expression: (in units of bits and power of two reliability exponents since the computation is simpler in that base) 

^W = .-log 2 (l + /?(2"-l)) , ^ = ^-log 2 (l + ^-D) (2g) 

V 

where 77 ranges from to 00. 

Simple algebraic manipulation allows the parameter 77 to be eliminated and this results in the rate-reliability 
tradeoff of ( fT9l ). The calculations for this and the simple low-rate bound are in Appendix III-DI 



5 If the feedback is not instantaneous, then there is no obvious scheme. Asymptotically optimal schemes for such cases are given in [33]. 
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B. Direct proof of achievability 

In this section, the asymptotic achievability of the BEC's fixed-delay reliability function (T28T ) is proven directly 
using a technique that parallels the bounding technique used for Theorem 13.21 

The key idea is to use the first-in-first-out property of the "repeat until received" strategy, treating the system as 
a D/M/l queue. The only way the i-th bit would not be received by the deadline is if there were too few successes. 
It is easy to see that this could happen if there were zero successes after it enters the system. But it could also 
happen if there were only one success since the previous bit entered the system, and so on. This is captured in the 
following: 

Lemma 6.1: The probability that bit i is unable to meet deadline + d can be upper-bounded by: 

nM ^ +d) * B ' ]< -i v % Zt stt^tf) <29) 

where the {Z t } are the iid random variables that are 1 if the i-th channel use is successful and if it is erased. 
Proof: See Appendix III-EI 



Smaller k < k type events 
have probabilities that die 
and are summed away 



k > k type events individually 

bounded by dominating one 
and union bounded collectively 




dominating 
error event 



Fig. 13. Error events beginning with message bits from earlier than k are those whose probabilities are getting exponentially small and are 
all less than the dominating event. The shorter events number linear in d and are all individually smaller than the dominating error event. 
This shows that the dominating event's exponent governs the probability of error as a whole. 



The next idea is to isolate the dominant term in the sum d29l ) and to bound the whole sum explicitly in terms of 
this. The idea is depicted in Figure [13] The potentially unbounded-length sum (since i is arbitrary) is broken into 
two parts. One part has a finite number of terms and each term is upper-bounded by the dominant term. The other 
part has an unbounded number of terms but that sum is bounded using a convergent geometric series. This is done 
explicitly rather than relying on asymptotic large-deviations theorems so that the resulting constants are available 
to us to calculate plots for finite delays. The details are in Appendix HI-FI but result in 



V(B i (\^+d)^B l ) 

< exp( ^K'Wl-VMR' + m-?)-^ ^ 

< exp( a 1 _ y { D(R>\\1-P) }) 



-. oo 

E eM-KD(R' + 2n- x ||1 - /?) - ex)) 



R 

1=0 



„ D(\*R'\\l -j3) . , /, ,D(\*R'\\l- f3) ei V 

where A* is coming from d27l ), n = d n^Tj§Mrr^m , and ei,e2 are constants that can be made arbitrarily small 

as d gets large. The term in the brackets [■ • • ] is a convergent geometric series while ( ^^^"xi^)"^ ) approaches 
1 as d and hence n gets large. 

Since A* does not depend on d, just notice that E a (R) = D ( x f or binary erasure channel to get: 

< P(B i {\^]+d)^B i ) < (7 + £Qfflq>(-d(l-e 3 )£ o 0R)) (31) 

for all d > dz(es, (3, R) where €3 > is arbitrary and 7, £ > are constants depending on e%,(3 and R. Since the 
linear term is dominated by the exponential, it is clear that the bound of (fl8T ) is asymptotically achievable for the 
BEC with large delays at all rates R. For non-binary erasure channels, it is obvious that the same proof holds. 
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Furthermore, since the FIFO-based encoder does not need to know what the target delay is, the code is clearly 
delay-universal or anytime in nature. □ 



C. The transmission delay view 

An alternative view of the communication problem over the binary erasure channel is useful when considering 
more general cases. Each bit's delay can be viewed as the sum of a queuing delay (that can be correlated across 
different bits) and its transmission delay Tj which is a geometric (1—0) random variable that is iid over different bits 

j. The event { d+ |-_i_*_|-_fc_-| ^-[jl^ %t < j+[^i^[~ ir ] ^ from d^9l ) can alternatively be expressed in this language 

as: {Y^)=kTj > d + [-^7] — [j^]}. This expresses the event that even if any backlog before k is ignored, the 
unlucky transmission delays alone are too much for the bit i to meet its deadline. d29l ) then becomes 



vm\^~\+d) ? B t ) < jznjz^ > d + r^i - r^D- (32) 

k=l 



With this interpretation, Theorem 13.31 about the binary erasure channel implies the following result about large 
delays in certain D/G/l queues: 

Corollary 6.1: Consider a communication system in which point messages arrive deterministically at a steady 
rate of R' messages per unit time, are FIFO queued up until ready to be served, and are then independently 
served using geometric (1 — 0) service times Tj. Given any €4 > 0, there exists a d±(e±,0,R) so that for all 
d > di, the probability that point message i has not completed service by time d + [-^7] is upper bounded by 
exp(— d(l — e^)E h ^ c (R') In 2) where E^ ec from ( [28] ) is the fixed-delay error exponent for the binary erasure channel 
with erasure probability and rate R' in bits per channel use. This fixed delay exponent is attained universally 
over all sufficiently long delays d > d$. 

Furthermore, this result continues to hold even if the independent service times Tj merely have complementary 
CDFs that are bounded by: P(Tj > k) < k . The service times do not need to be identically distributed. 

Finally, suppose the point message rate R' = — where m > m is a positive integer and the independent service 
times Tj satisfy V(Tj > m + k) < . Then the probability that point message i (which arrived at time im) has 
not completed service by time d + im is upper bounded by exp(— d(l — e/C)E h a C (R") In 2) where R" = (m — m)~ l . 
This fixed delay exponent is also attained universally over all sufficiently long delays d> d^ + m. 
Proof: In place of bits, there are messages. The geometric random variables can be interpreted as the interarrival 
times for the Bernoulli process of successful transmissions. The rate R' bits per channel use turns into R' In 2 nats 
per channel use. Finally, the 7 + £d polynomial from (OTT ) can be absorbed into the exponential by just making 
c?4 and £4 a little bigger than the original d% and £3. This establishes the result for independent geometric service 
times. 

For the case of general service times whose complementary CDF is bounded by the geometric's complementary 
CDF, the reason is that the errors all come from large deviations events of the form Yl)=k Tj > I- For each j, start 
with an independent continuous uniform [0, 1] random variable Vj and obtain both Tj and Tj from Vj through the 
inverse of their respective CDFs. This way, each of the Tj can be paired with a geometric Tj random variable such 
that Vu,Tj(u!) < T'j(u) where u represents an element from the sample space. Since 

i 

{uj\Y,Tj(uj)>l} = [J f]W\T n (u) > l n } 

c |J f>l?» > In} 

l\LeM,J2 n l n =l n 
i 

= {o;|^r»>/}, 

j=k 

it is clear that Vi^ l j =k Tj(u) > I) < 'P(Yl)=kTj( L0 ) — anc ^ so tne same error probability bounds can be 
achieved. 
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Finally, consider the case of R' = — and independent service times bounded by those of a constant plus 
geometries. (l32l simplifies to 

V{Mi{im + d) ^ Mi) < ( ^Tj > d + im - km ] (33) 

fc=i \j=k J 

i I i 

= ^2 V Tj > d + (i - k)m 

k=l \j=k 
i I i 

= V \ ^2( T 3 - rh)>d+(i- k)(m - m) 

k=l \j=k 

i I % 

= [ ^2(Tj-m) > d + i(m-m)-k(m-m) I . (34) 

k=\ \j=k J 

Notice that in (l34l . the random variable Tj = Tj — fh has a complementary CDF bounded by a geometric and 
corresponds to (l32l ) with a point-message rate of R" = (m — in)" 1 . Thus, the error exponent with delay is at least 
as good as E^ ec (R") In 2 for the point messages. □ 



VII. ACHIEVABILITY FOR GENERAL CHANNELS 

The goal of this section is to prove Theorems I3.4l and l3.5l Rather than starting with channels with strictly positive 
zero-error capacity, it is conceptually easier to start with generic DMCs but add a low-rate error-free side channel 
that can be used to carry "control" information. This information is interpreted as a kind of punctuation used to 
make the channel output stream unambiguously understandable to the decoder. The idea is that the rate of this 
error-free control channel is much lower than the message rate that needs to be communicated. This allows the 
result to extend immediately to channels with strictly positive zero-error capacity. For general channels, the control 
channel is synthesized and its own errors must be taken into account. 

A. The scheme for fortified systems with noiseless feedback 

A "fortified" model is an idealization (depicted in Figure [T4l that makes an error-free control channel explicit: 
Definition 7.1: Given a DMC P for the forward link, a ^-fortified communication system built around it is one 
in which every /c-th use of P is supplemented with the ability to transmit a single error-free bit to the receiver. 

Original forward DMC channel uses 

llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ' 



-Fortification error-free side channel uses 



Si S2 S3 Sa S5 Sg 



Fig. 14. Fortification illustrated: the forward noisy channel uses are supplemented with regular low-rate use of an error-free side channel. 

In comparison to the encoders with feedback from Definition 13.21 fortified encoders get to send an additional 
error-free bit St at times t that are integer multiples of k. The decoders are naturally modified to get causal access 
to the error-free bits as well. 

The idea is to generalize the repeat-until-received strategy used for the erasure channel in Theorem [33] A family 
of schemes indexed by three parameters (n,c,l) is described first, and the asymptotic achievability of E a>s (R) is 
shown by taking an appropriate limit over such schemes. 
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Forward DMC channel uses 



Error-free side channel i 



one chunk 



J L 



deny 



J L 



deny 



J 1 1 



confirm 



Prior block list disambiguation 
Previous block confirmation 



Unused control bits 



List disambiguation 



Fig. 15. One block's transmission in the (n, c, I) scheme for a i-fortified channel. In this case I = 2 and c = 4, so there are 4 error-free 
bits per chunk, n is not visible at this level since the number of chunks needed for successful transmission is random. Typically fewer than 
n chunks are needed so n could be 5 in this example. 



Call c > 1 the chunk length in terms of how many control bits are associated with each chunk, 2 the list length 
(with I < c — 1), and n > I the message block length in units of chunks. The (n, c, I) randomized communication 
scheme (illustrated in Figure [131) is: 

1) The encoder queues up incoming message bits and assembles them into message blocks of size bits. 
One such message block arrives deterministic ally every nek channel uses. 

2) At every noisy channel use, the encoder sends the channel input corresponding to the next position in an 
infinite-length random codeword associated with the current message block. 

Formally, the codewords are Xi(j,t) where i > represents the current block number, t > is the current 
channel-use time, and < j < exp(nckR) is the value of the current message block. Each Xi(j,t) is drawn 
iid from X using the Eo(n) maximizing distribution q. An rj is chosen such that the desired rate R < E °^'^ f 
while the target reliability is also a < Eo(?],q). 

If there is no message block to send, the encoder just idles by transmitting the next letter in the past message 
block. 

3) If the time is an integer multiple of ck, the encoder uses the noiselessly fedback channel outputs to simulate 
the decoder's attempt to ML-decode the current codeword to within a list of size 2 l . 

If the true codeword is one of the 2 l entries on the list, the encoder sends a 1 (confirm) over the noiseless 
forward link. The encoder places into a control queue I bits representing the true codeword's index within 
the decoder's list. The encoder then removes the current message block from the message queue. 
If the true block is not in the decoder's list, the encoder sends a (deny) over the error-free forward link. 
The can be viewed as a null punctuation mark while the 1 corresponds to a comma delimiting one variable- 
length block from another. When the list-disambiguation information is sent, it can be interpreted as a specific 
type of comma. There are thus just I + 1 different kinds of punctuation in the system. 

4) If the time is an integer multiple of k but not an integer multiple of ck, then the encoder looks in the control 
queue and transmits one of these bits over the error-free link, removing it from this second queue. If there 
are no control bits waiting, then the error-free link is ignored. 

Since c > I, all I of the control bits will be communicated within one chunk. 

5) At the decoder, the encoder's message queue length is known perfectly since it can only change by the 
deterministic arrival of message blocks or when an error-free confirm or deny bit has been sent over the 
noise-free link. Thus the decoder can correctly parse the received channel uses and always knows which 
message block a given channel output Y t or fortification symbol St corresponds to. 

6) If the time is an integer multiple of ck and the decoder receives a 1 noiselessly, then it decodes what it has 
seen to a list of the top 2 l possibilities for this message block. It uses the next I error-free bits to disambiguate 
this list and commits to the result as its estimate for the message block. 
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B. Analysis of end-to-end delay and probability of error 

It is clear that this hybrid-ARQ scheme does not commit any errors at the decoder. Some blocks just take longer 
to make it across than others do. Furthermore, notice that the delay experienced by any message bit can be divided 
into four parts: 

1) Assembly delay: How long it takes before the rest of the message block has arrived at the encoder. This is 
bounded by a constant nek channel uses. 

2) Queuing delay: How long the message block must wait before it begins to be transmitted. 

3) Transmission delay: How many channel uses it takes before the codeword can be correctly decoded to within 
a list of 2 l . This is a random quantity Tj that must be an integer multiple of ck channel uses. The Tj are iid 
since the channel is memoryless and the random codebooks are also iid. 

4) Termination delay: How long the decoder must wait before the block is disambiguated by the error-free 
control signals. This is bounded by a constant Ik channel uses. 

Since the assembly and termination delays are constants that do not depend on the target end-to-end delay, they 
can be ignored and the focus kept on the queuing and transmission delays. This is because our interest is in the 
fixed-delay behavior for asymptotically large delays much longer than nek. Since the transmission delays are iid, 
the approach is to apply Corollary 16.11 and this requires a bound in terms of a constant plus a geometric. 

Lemma 7.1: The (n, c, I) transmission scheme using input-distribution q at rate R for a ^-fortified communication 
system over a DMC has iid transmission times Tj satisfying 

V(Tj - [t(p,R,n,q)]ck > tck) < [exp(-ckEo(p, q))f (35) 

for all < p < 2 l and positive integer t where t(p, R,n,q) = _^ n and C(p, q) = ^it£l_ 
Proof: See Appendix III-GI 

n = 16 total chunks in a block _ ,„,,„, , 

Potential slack chunks 



True minimum number of chunks n ^ 
required based on Shannon capacity 

"Minimum" number of chunks t(p) 
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E„(p)-pR 



slack worth at least 
Eo(p) 



O(p) 

required based on target reliability Eo(p) 



Fig. 16. Because the message rate is less than capacity, there is some "slack" in the system. The amount of slack varies with the target 
reliability -Eo(p) and goes to zero when R = E °^ p ' > . The "essential" part of the block is denoted t(p) and its complement is the slack. 



Lemma 17.11 is illustrated in Figure [16] and then the application of Corollary 16.11 is illustrated in Figures [T7] and 
These illustrate the achievability of the fixed-delay exponent 0.44 at a rate of 0.37 nats. The gap between 0.44 
and 0.37 on the rate axis in Figure [T7] depicts the fraction of "slack" channel uses that are available to communicate 
a message block with reliability 0.44 while still draining the queue faster than it is being filled. The block length 
n must be long enough so that the slack represents at least a few channel uses. As the block length n becomes 
longer, it is possible to move up to the reliability limit illustrated by the inverse concatenation construction. 

Consider time in ck units. Let R' = — be the rate at which message blocks are generated in terms of blocks 
generated per ck channel uses. R" = n _^p Rn ^ * s tne rate at which we evaluate the BEC's fixed-delay reliability 
in the application of Corollary 16.11 The effective "erasure probability" is (3 = enp(—ckEo(p, q)). 

Recall that R < E °( p > where the q distribution is chosen as the Eq{p) achieving distribution. The quantity 
n — \t(p, R, n, q)1 has a special significance since it captures the amount of slack in the system when viewed with 
parameter p. This slack term is positive for large enough n > -^ p Jp since 
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Fig. 17. Why at least a 0.44 fixed-delay exponent is achievable at rate 0.37 nats per channel use. The thick curve is the sphere-packing 
bound and the thin curve on top is the uncertainty-focusing bound. The thick tangent represents using a list size of 1. The gap between 
0.44 and 0.37 on the rate axis depicts the fraction of "slack" channel uses that are available. The thin tangent is the one used in the 
inverse-concatenation construction for the bound at rate 0.37 nats per channel use. 



n-\t(p,R,n,q)] > ( ° {P ^ - - ) n - 1. (36) 



C(p,q) 



Thus 

R" 



n - \t(p,R,n,q)~] 



< (( a <»fi-*) „-!)-■■ 
Notice that R" can be made as small as desired by choosing n large while (3 can be made extremely small by 



choosing c large. Applying Theorem 13.31 tells us to set 

C(p,q)-R 
1 + 2r = — — n — 1 

n(p,c,k,l,r) = CM (2 + 2r) (37) 
C{p, q)-R 

in order to get to within (2 In 2) exp(— rckEo(p, q)) of the exponent ckEo(p,q) in terms of delays measured in 

ck time units, or to within exp(— rckEo(p, q)) of the exponent of Eo(p,q) in terms of delays measured in 
channel uses. 

Putting it all together, for any small A > 0, and p > such that R < a delay-exponent of Eq(p) — (3 In 2) A 
is clearly achievable by setting I = max(0, [log 2 p]), choosing chunk size 

r In 16 nN 

c = max/ + l, — — (38) 
kE (p) 

and then choosing r big enough using 

. ln(Acfc) . 

r >max(0, y ). (39) 
ckE (p) 
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Fig. 18. At the top, the original timeline depicts the arrival of message blocks and the target delay. The "essential" component t(p) is 
also shown. In the middle, a particular realization is shown illustrating how an error can happen when the service times Tj of the blocks 
become too large. At the bottom, the essential components of the service times are removed and the performance of the system is shown to 
be bounded by that of a queue with iid geometric service times serving the low-rate deterministic arrival of point-messages corresponding 
to the message blocks. 



With c and r defined, n can be obtained from d37l) . 

Notice that k is arbitrary here and can thus be made as large as desired. This corresponds to the fact that the 
amount of "punctuation" information can be made as small as desired, assuming that the target end-to-end delay 
is large enough^ 

Each (n, c, I) code is also delay universal since it is not designed with a maximum d in mind. The longer the 
decoder is willing to wait, the lower the probability of error becomes. This property is inherited from the repeat- 
until-success code for the erasure channel through Corollary 16.11 □ 



6 The target end-to-end delay must at least be large enough to absorb the roughly 2nck channel uses corresponding to the sum of assembly 
delay and essential service time for the message block. It is beyond that point that the delay exponent analysis here kicks in. 
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C. Channels with strictly positive zero-error capacity 

The above communication scheme is easily adapted to channels with strictly positive zero-error capacity by just 
using a zero-error code to carry the punctuation information. There is no k. Instead, let 6 be the block length 
required to realize feedback zero-error transmission of at least I + 1 bits. As illustrated in Figure [19j terminate 
each chunk with a block-length-^ feedback zero-error code and use it to transmit the punctuation information. If 
the chunk size is c channel uses, then it is as though we are operating with only a fraction (1 — -) of the channel 
uses. This effectively increases the rate to R/ (1 — -) and reduces the achieved delay exponent to a(l — -) as well. 
This overhead becomes negligible by making the chunk size c large giving us the desired result. 

DMC channel uses 
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(I + 1 ) -bit per chunk code for punctuation: zero-error feedback block codes or infinite-constraint convolutional code 

Fig. 19. One block's transmission in the channel code with time-sharing between the message code and punctuation code. Each chunk is 
terminated with a 6>-length segment to convey punctuation information. If the channel has zero-error capacity, then a zero-error block code 
can be used to tell the decoder whether to move on to the next block of the message or not. If it is to move on, the chunk terminator 
also tells which of the 2 l most likely messages was conveyed by this particular block. By making the chunk length c long, the overhead 
of the control messages becomes asymptotically negligible since 6 remains fixed if there is zero-error capacity. When there is no zero-error 
capacity, then 8 stays proportional to c and an infinite-constraint-length random convolutional code is used to carry punctuation information. 



D. Delayed feedback 

Let 4> be the delay in the noiseless feedback. So the encoders now know only Y*~^ in addition to the message 
bits. Everything continues to work because the chunks c can be made much longer than <j). The last — 1 channel 
uses in a chunk can then be discarded without any significant overhead. 

Thus, Theorem 13.41 holds for any communication system in the asymptotic limit of large end-to-end delays even 
if there are small round-trip delays in the feedback. All that is required is some way to provide infrequent, but 
unmistakable, punctuation information from the encoder to the decoder. □ 



E. Channels without zero-error capacity: paying for punctuation 

All that remains is to prove Theorem 13.51 When the channel has no zero-error capacity, then it is still possible 
to follow the Section- fVII-CI approach of allocating 9 channel uses per chunk to carry punctuation information. The 
channel uses are partitioned as before into two streams assigned to two sub-encoders. The first is exactly as it was 
in the Section IVII-CI and carries the message itself using a variable-length channel code with the dynamic length 
chosen to ensure correct list-decoding. This first encoder generates punctuation messages at the end of every chunk 
and these are the input to the second encoder. The second encoder's role is to convey this punctuation information 
consisting of / + 1 bits for every chunk. 

Instead of using a zero-error code, the second encoder is implemented using an infinite-constraint-length time- 
varying random convolutional code. The trick of Appendix U can be used to reduce the expected computational 
burden for encoding/decoding by using feedback, but essentially this sub-code operates without feedback. 

The decoder also runs with two subsystems. One subsystem is responsible for decoding the punctuation stream. 
This can be implemented using either an ML decoder or a sequential decoder from [23]. Either way, it is responsible 
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for giving its current best estimate for all punctuation so far. By the properties of random infinite-constraint- 
length convolutional codes, this attains the random-coding error exponent with respect to delay for every piece of 
punctuation in the stream. The earlier punctuation marks are almost certainly decoded correctly while more recent 
punctuation marks are more likely to be subject to error. 

This current estimate for all the punctuation so far is then used by the subsystem responsible for decoding the 
message bits themselves. The decoded punctuation is used to tentatively parse the channel outputs into variable- 
length blocks and then tentatively decode those blocks under the assumption that the punctuation is correct. Any 
bits that have reached their deadlines are then emitted. Although the decisions for those bits are now committed 
from the destination's point of view, this does not prevent the system from re-parsing them in the future when 
considering estimates for other bits. 

1 ) Analysis: An error can occur at the decoder in two different ways. As before, the message-carrying stream 
could be delayed due to channel atypicality in its own channel slots. The new source of errors is that the punctuation 
stream could also become corrupted through atypicality in these other channel slots. As a result, the punctuation 
overhead 9 must be kept proportional to the chunk length c to avoid having punctuation errors cause too many 
decoding errors. 

Set 9 = ipc for a constant ip to be optimized. The rate of the punctuation information is = -4^ and goes 
to zero as c — > oo. Since the random-coding error exponent at rate approaches Eq(1), this is the relevant error 
exponent for the second stream relative to the channel uses that it gets. But there are only ip punctuation-code 
channel uses per second and so the delay-exponent for the punctuation stream is actually iJ)Eq(1) with respect to 
true delay. 

Meanwhile, the chunk size in the message stream is c' = c(l — ip). The effective rate of the message stream is 
thereby increased to jz^- Assuming that the punctuation information is correct, the fixed-delay error-exponent is 
as close as we would like to E ayS (j^) with respect to the delay in terms of message-code channel uses. But there 
are only (1 — iji) message-code channel uses per second and so the delay exponent approaches (1 — VO-^mCt^) 
with respect to true end-to-end delay. 

Consider a large fixed delay d. It can be written as d = df + d rn in d different ways. Let df be the part of the 
end-to-end delay that is burned by errors in the punctuation stream. That is, with probability exponentially small 
in df, this suffix of time has possibly incorrect punctuation information and so cannot be trusted to be interpreted 
correctly. If the bit did not make it out correctly in the d m time steps (corresponding to (1 — tp)d m channel uses 
for the message-code) where the punctuation is correct, we assume that it will not come out correctly. 

Since the channel uses are disjoint between the punctuation and message streams, the two error events are 
independent. The probability of an error with delay d can thus be union-bounded as 

V(Bi(d) + Bt) 
d 

< P (message error with delay d m )"P(punctuation error with delay d — d m ) 

d m =l 

d R 

< K ^ exp(-cU(l - i,)E a {-—) + et))K 2 exp(-(d - d m )(^E {l) + e 2 )) 

d m =l ^ 

< dK 1 ^ 2 exp(-d(min Ue (1), (1 - i/j)E a>s (^^)\ - ei - e 2 ) 

where ei,e 2 are arbitrarily tiny constants and K\K 2 are large constants that together capture the nonasymptotic 
terms in the earlier analysis. 

Since the focus here is on the asymptotic error exponent with delay, the polynomial and e terms can be ignored 
and an achievable exponent is found by choosing tp so that the two exponents are balanced: 

E' = ^o(l) = (1 - i>)E a , s {-?—). 

1 — ip 
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Evaluating the parametric forms ( fT8l ) using rj = p for E as , we get a pair of equations 



if>E (l) = (l-1>)Eo(p), (40) 

m . -5-. ,4„ 

p l - v 

The first thing to notice is that simple substitution gives 

R _ (1 - jj)Eo(p) _ ^p(l) _ E' 
P P p' 

Solving for tp shows (after a little algebra) that 

rf, = ^ (42) 

This way 1 - tp = go( ^ o( ^ and the first equation is clearly true. Similarly ^ = 1 + |^ and ^ = g|f| 
and thus the second equation is also true. Evaluating, 



E' = «(1) 

Eo(p)E (l) 



E (l)+E (p) 



1 1 

+ 



-i 



Eo(p) E (l), 

Simple (but mildly tedious) Taylor series expansion around the p = point gives E'(p) = + Cp+ | ( 8 ^°' > — 

2-j^Y))p 2 + o(p 2 ) and thus = C — (;^7jy — g - ^ ^ ) p + o(p). Taking the ratio of the first order terms gives 
the desired slope in the vicinity of the (C, 0) point. The fact that this slope is strictly negative is clear from the 
fact that < o. □ 

2) Computation: As in the rate-^ erasure case discussed in Section ll-Al the computational burden for the (n, c, I) 
schemes is a constant that depends only on the particular scheme (and hence indirectly on the target rate -reliability 
pair) and not on the target end-to-end delay. As described, the complexity is exponential in the block length nc 
since both the encoder and decoder must do list decoding among the codewords. The computational burden of the 
punctuation code is light since by Appendix U it is like running a sequential decoder for a very-low -rate convolutional 
code. 



F. More examples 

Rather than considering an example using a DMC with strictly positive zero-error capacity, it is more instructive 
to consider a BSC with a fortification side-channel of rate ^ bits per channel use. The capacity of the BSC 
with crossover probability 0.02 increases to 0.61 nats with such fortification and the Burnashev bound becomes 
infinite. Figure [20] shows the effect of zero-error capacity on the sphere -packing and uncertainty-focusing bounds. 
At high rates, the fortified uncertainty-focusing bound looks like it has just been shifted in rate by 0.01 nats, just 
like the fortified sphere -packing bound. However, because of the flatness of the classical sphere -packing bound 
at high rates, the sphere -packing bound visually appears unchanged by fortification on a plot. At very low rates, 
the two behave differently. The fortified uncertainty-focusing bound tends smoothly to infinity at 0.01 nats while 
the fortified sphere-packing bound jumps abruptly to infinity, reflecting the typical behavior of the error exponent 
curves for channels with strictly positive zero-error capacity. 

Looking at a deeper level of detail, Figure [2J] illustrates the time-nature of the dominant error events at different 
rates. The question is for how long does the channel behave atypically for a bit to miss its deadline. In fixed- 
block-length coding, the usual source of errors is slightly atypical behavior across the entire block. As shown in 
Section [IV] when feedback is not available, the usual errors mainly involve the channel behaving atypically after 
the bit in question arrived at the encoder. 

By contrast, in the fixed-delay context with feedback, the dominant error events involve more and more of the 
past as the rates get large. This means that the typical way for a bit to miss its deadline is for the channel to have 
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Rate (in nats) 



Fig. 20. The sphere-packing and uncertainty-focusing bounds, with and without a noiseless side-channel of rate i for a BSC with crossover 
probability 0.02. The lower curves are the sphere-packing bounds and the upper curves are the uncertainty-focusing bounds. The thin lines 
represent the fortified cases with the added noiseless side-channel. 

been behaving atypically for some time before the bit even arrived at the encoder, and for this atypical behavior 
to continue till the deadline. At intermediate rates, the future behavior (after the bit has arrived at the encoder) 
becomes more important since it is more likely for the channel to become very bad for a shorter period. 

At very low rates, the fortified and unfortified systems exhibit qualitatively different behavior. For unfortified 
systems, the dominant error events soon involve essentially only the future. The dominant event approaches the 
channel going into complete "outage" (e.g. the channel flipping half the inputs of a BSC) after the bit arrives at 
the encoder. For systems with positive zero-error capacity, such a complete outage is not possible as the message 
bits can always dribble across the error-free part. For an error to occur, it is essential to build up a large enough 
backlog in the queue and thus the past behavior starts to become dominant again. The curves diverge for the same 
rates at which the fortified case's uncertainty-focusing bound is much better than the unfortified case. 

Figure |22~1 shows the fixed-delay reliabilities achieved by the (n,c,l) schemes^ of Section IVII-AI for the specific 
cases of (10,3,2), (20,4,3), (50,8,6). These are delay universal since they hold with all sufficiently long delays. 
Increasing I increases the list size and helps the low rate performance while large block lengths n are needed to 
perform well at higher rates. It is interesting to see how how the (10,3,2) scheme is already spectacularly better 
than the feedback-free case for all low to moderate rates. In this case, there are 10 * 3 * 50 = 1500 BSC uses and 
only 10 * 3 = 30 error-free control bits corresponding to a typical message block. 

VIII. Conclusions 

This paper has shown that fixed-block-length and fixed-delay systems behave very differently when feedback is 
allowed. While fixed-block-length systems do not usually gain substantially in reliability with noiseless feedback, 
fixed-delay systems can achieve very substantial gains for any generic DMC. The uncertainty-focusing bound 
complements the classical sphere-packing bound and gives limits to what is possible. Furthermore, these limits 
can be approached in a delay-universal fashion for erasure channels and any channel with positive feedback zero- 
error capacity if the encoders have access to noiseless channel output feedback, even if that feedback is slightly 
delayed. The computational requirements in doing so do not scale with the desired probability of error and only 
depend on the target rate and delay exponent. The details of this work establish a connection between queuing 

7 The schemes plotted here have been slightly modified to use the noiseless side-channel to carry codeword information whenever it is 
not needed to carry punctuation information. This more accurately reflects the typical behavior of channels with strictly positive zero-error 
capacity. 
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Fig. 21. The dominant error events illustrated by plotting the ratio of future to past in dB scale. The horizontal axis is rate and the vertical 
axis is 101og 10 where A* is from i27\ . The thicker red curve represents the unfortified channel while the thin black curve is the 

■i -fortified system. 

and communication over noisy channels with feedback. For the constructions given here, the end-to-end delay is 
asymptotically dominated by time spent waiting in a queue. 

Given that complete noiseless feedback now has unambiguously clear value for reliable communication, it is 
important for the community to explore the required quality of feedback. This paper only shows that slightly 
delayed feedback can be tolerated. The case of noisy or rate-constrained feedback in the fixed-delay context is 
almost entirely open (see [33] for the case of erasure channels on both the forward and feedback links). In addition, 
both the upper and lower bounds here only cover the case of a single message stream. The multistream rate/reliability 
region is still unknown even for the BEC case [37]. 

Stepping back, these results are also interesting because they show how feedback changes the qualitative nature 
of the dominant error events. Without feedback, errors are dominated by future channel behavior, but when feedback 
is available, the dominant event involves a mixture of the past and future. When the rate is low, the future tends to 
be more important but when the rate is high, the past starts to dominate. This brings to mind Shannon's intriguing 
comment at the close of [38]: 

[The duality between source and channel coding] can be pursued further and is related to a duality 
between past and future and the notions of control and knowledge. Thus we may have knowledge of the 
past and cannot control it; we may control the future but have no knowledge of it. 

In [39], we explore the source-coding analogs of the results given here. In particular, feedback is found to be 
irrelevant in point-to-point lossless source coding and the dominant error events involve only the past! That makes 
precise the duality hinted at by Shannon. 

Finally, in [40], the techniques developed here are extended to lower-bound the complexity of decoding based on 
iterative message-passing for general codes. The linear concept of time here is generalized to the message-passing 
graph. The role of delay is thus played by the decoding neighborhood within the graph and the corresponding 
bounds reveal the complexity cost of approaching capacity with such decoding algorithms. 
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influenced this work in important ways, and the Berkeley students in the Fall 2004 advanced information theory 
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Fig. 22. The fixed-delay error exponents of different schemes for a ■^-fortified BSC with crossover probability 0.02 used with noiseless 
feedback. The lowest curve is the sphere-packing bound limiting feedback-free performance. The three new curves represent what is attained 
by the (10, 3, 2), (20, 4, 3), (50, 8, 6) schemes described in Section [VII-AI and vary by block length, granularity, and the size of the lists used 
for list decoding. The uncertainty-focusing bound with and without fortification is plotted for reference. 

course who forced me to simplify the presentation considerably. The anonymous reviewers are also thanked for 
their very helpful comments. 

Appendix I 

Feedback, convolutional codes, and complexity 

The encoder is allowed to "look over the shoulder" of the decoder and have access to noiseless feedback of the 
channel outputs. This appendix give^ the convolutional parallel to the Burnashev problem of variable-block-length 
codes. For ease of exposition, suppose the channel is binary input and that the uniform distribution is an optimal 
input distribution. If another input distribution is desired, mappings in the style of Figure 6.2.1 of [8] can be used 
to approximate the desired channel input distribution. Use R' = ^ to refer to the input rate in bits per channel use 
rather than nats per channel use. Apply the "encode the error signals" advice of [31] to get the following simple 
construction of a random code: 

• Start with an infinite-constraint-length random time-varying convolutional code. The j-th channel input Xj = 
J2k Hk{j)Bk mod 2 is generated by correlating the input bits B{ R with a random binary string h( R (j). 

• Use the noiseless feedback to run a sequential decoder at the encoder. This gives the encoder access to 
B\ (j ~ 1) — tne tentative estimates of the past input bits based on the channel outputs so far. Set 
BjR'(j — 1) = since there is no estimate for the new bit, and then compute Bk{j) = B^ + Bk(j — 1) mod 2 
to represent the current error sequence. Since the probability of bit error is exponentially decreasing in delay 
[23], only a small number of the Bk(j) are nonzero, and furthermore, these are all around the more recent 
bits. The expected number of nonzero error bits is therefore upper bounded by some constant. 

• Run the infinite-constraint-length convolutional code using the error sequence rather than the input bits. Xj = 
J2k Hk(j)Bk(j) mod 2 = Xj + [J2k Hk{j)B) c (j — 1)] mod 2. Input the resulting Xj into the channel. 
Since the additional term [• • • ] is entirely known at the receiver and modulo 2 addition is invertible, this 
feedback code has exactly the same distance properties as the original code without feedback. Furthermore, 
since there are only a finite random number of nonzero error bits and the encoder knows where these are, the 
encoding complexity is a random variable with finite expectation. 

8 The scheme we describe in this subsection is too obvious to be original to us, but we are unaware of who might have come up with it 
earlier. 
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If a block-code is desired, then pick an arbitrary length d to terminate a block, and choose an overall block 
length n so that d is insignificant in comparison. 

The expected per-channel-input constraint-length used by the code is a finite constant that only depends on the 
rate, while the overall probability of block error dies exponentially with the terminator length d. Consequently, the 
expected-constraint-length error-exponent for variable-constraint-length convolutional codes is infinite with noiseless 
feedback. If we also count the expected number of computations required to run the encoder's copy of the decoder, 
then this result holds for all rates strictly below^ the computational cutoff rate Eq(1). Even though noiseless 
feedback is used by the encoder to generate the channel inputs, the decoding is "sequential" in the sense of Jacobs 
and Berlekamp [41] and suffers from the resulting computational limitation of having a search-effort distribution 
with certain unbounded moments. 

At rates above Eq(1), the same flavor of result can be preserved in principle by using the concatenated-coding 
transformations of Pinsker [42] (as well as others described more recently by Arikan [43]) to bring the computational- 
cutoff rate Eq(1) as close to C as desired. Thus, the expected-computation error exponent for convolutional-style 
codes with noiseless output feedback can be made essentially infinite at all rates below capacity. The expected 
complexity is a constant that depends only on the desired rate, not on the target probability of error. 

Appendix II 
Extended Proofs 

A. Lemma I4.il 



niR-St) = H(B^ R '~ Si) ) 

<T 77R n ( R '- 5 i). jD n (R'~ s 'i) v n\ 

^(a) 1 \°\ j a \ i^l J 

= H(Y?) +H(Bf R '- S[) \Y; 1 ) - F(y 1 n | J B" ( ' R '~ <5D ) - H(B^ R '~ S ' x \Yi , B^ R '~ 5 '^) 

= {b) HiYf 1 ) + HiBf^-^lY?) - H{Y{ l \B™ {R '~ 5,l) ) 

= HiB^'-^lYl 1 ) + I(Y?; B^ R '~ 5i) ) 

< (c) H(B^ R '- 5[) ) + I(Y?; Bf R '~ S[) ) 

< {d) H{Bf R '- 5 'J)+I{X^Y?). 

The first equality holds because the message bits are fair coin tosses, (a) comes from the data processing inequality 
when considering the following trivial Markov chain: B^ R Sl ' — (B^ R Si \y{ 1 ) — B^ R ^ that comes from 
the fact that the channel outputs and the error signals are enough to reconstruct the original bits. After expanding in 
terms of entropies, the H(B^ R Sl ^\Y[ l , B^ R term can be dropped to give (b) since this conditional entropy 
is zero because the error signal can be reconstructed from the message bits and the channel outputs, (c) comes 
from dropping conditioning, while the final inequality (d) comes from applying the data processing inequality to 
the Markov chain B^ R — X™ — Yf 1 capturing the lack of feedback in the system. 



B. Lemma W4\ 

Before proving Lemma 14.41 it is useful to establish a result involving typical sets. 

9 At depth r within a false path, each node expansion for a sequential decoder requires 0(r) multiply-accumulate operations to evaluate. 
This polynomial-order term is insignificant when compared to the rate-dependent exponential increase in the number of false nodes with 
increasing search depth. Thus, the polynomial term can be bounded away by just treating it as slight increase in the rate. 
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1 ) Typical set lemma: 

Lemma 2.1: For every finite DMC G and ei, e.2 > 0, there exists a constant K such that for every x 

V G (Y € 4 ue ' 2 \X = x) > 1 - \X\\y\ exp(-Kd) (43) 
where d is the length of the vectors x, Y, and the appropriate typical set is 

'* = {y\^ x e * either ^ < 6 2 or Vy G e - ei,^. + d)} (44) 

where n x>y (x, y) is the count of how many times (x, y) occurs in the sequence (xi, yi), (x2, 2/2), ■ ■ • , (^d, and 
n x (x) is the count of how many x are present in the d length vector x. 

Furthermore, for any y £ Jt 1 ^ 2 , the probability of the sequence y under a different channel P satisfies 



D(G\\P\f(x))+(2e 2 + e 1 ) ^ |ln^ 

V / r-\_Ln -J.n Py\x 



) (45) 



V P {Y? = y\Xj = x) > 
V G (Yf = y\Xf = x) ~ i y 

where r(x) is the type of x. In particular, 

- - I ^ e X p(-rf[max^(G|^|rO + (2c 2 + ei) £ I^IU- (46) 
/ - - x) *,»:n.(a)#>,ff,|.#> P »l* 

Proof: The first goal is to establish ( |43l . For every x £ X, the weak law of large numbers for iid finite random 
variables says that the relative frequency of y's will concentrate around g y \ x ± e\. Simple Chernoff bounds for the 
Bernoulli random variables representing the indicator functions tell us that this convergence is exponentially fast 
in that V(x, y)3( x , y > so that if the channel input is always x for a length d, the random number N y of times 
the channel output is y satisfies 

V{\^j--g y \ x \ > e!\Xf = xf) < eM-Cx,yd). 

Let K' = min^e^ ygy £ Xj3/ > 0. Set K = ^K' since there are at least 62d occurrences of the relevant x values. 
Finally, apply the union bound over all possible pairs to get (l43l ). 

To show (1431 ). first note that those (x, y) pairs for which g y \ x = can be ignored since these cannot occur in 
any sequence with nonzero probability under G. Then 

d 

V G {Y? = y\Xf = x) = Hg yi \ Xi 

i=l 

TT n m , y (x,y) 

11 ^y\x 

= ( II 9y\* d 



</ 



n xy (x,y) 
exp(d 2^ — j ^9 y \x) 

exp(d > — — > — — lny \ x . 



Similarly 



V P {Y{ = y\X, =x)= exp(d ^ — — ^ « hxp y \ x ) 



x&x y ey 

The ratio of the two probabilities is thus 

VpiYf = y\Xf = x) _ ^ ^ n x , y (x, y) g y \ 



expM^^E^^ln. 



x • 



V G {Yt = y\Xf = x) ^ ^ K '^ y n x (x) p y \ x 
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Now apply the definition of Jt 1,e2 and first bound the contribution to the exponent by those inputs x G X r 
that occur too rarely: n x (x) < e 2 d. We drop the arguments of (x,y) when they are obvious from context. 



y rx y!^Li n ^ < y 

xGX rar " y ey x Fy \ x xex™^ 



< ^ £ |ln^|+ £ r.^ln^ 



x 



For the non-rare x, the n^y are already within e\ of g y \ x and thus, for € J 

VpjYf = yf\Xf = xj) ~ ^ 

; — j -r t — j j- = ex.pl— d > r x > — 7I11 ) 

7> G (1? = Vf \Xf = xf) ^ ^ y r x d p yl J 



£l,£2 



> eM-d\Y, r -Y,9y\^— + {^2+e 1 ) £ |ln^| 

x£X yey Py \ x x,y.r^0,g yl ^0 Py \ x 



exp(-d[£>(G||P|f) + (2e 2 +ei) Yl ' ln 



9y\x 



x,y:r x ^O,g v \ x ^O Py ^ x 

which establishes (|45T ). To get (l46l ). just bound by the worst possible r. 

2) Proof of Lemma \4.4\ itself: If g y \ x / 0, then it is safe to assume p y \ x / as well since otherwise the 
divergence is infinite and the Lemma is trivially true. 

The finite sum ^ x y . r ^ 09 | In | is thus just some finite constant K' that depends only on G and P. By 

choosing e±, e 2 small enough, it is possible to satisfy (2e 2 + ei) Yj X ,y.T a ^o,g ix ^o I ln j^jf I < e - 

The event ^4 has a substantial conditional probability 5 when channel G is used and this probability does not 

diminish with d. Consequently, Lemma |2~T1 implies that for the chosen £1,62 > 0, there exists a constant K so that 

V G (Y e Jt 1 ' e2 \X = x) > 1 - \X\\y\exp(-Kd). 

Pick a d (G, 5, e 1 ,e 2 ) > large enough so that \X\\y\ exp(-Kd ) < f . Thus V G (A n J^' e "\X = x) > f . The 

immediate application of the second part of Lemma 12.11 gives 

V P {A\X = x) > V P {AnJi 1 ' e2 \X = x) 

> | exp I -d[D(G\\P\f(x)) + (2e 2 + ei) £ |ln^- X| 



> ^exp(-d[D(G\\P\f(x)) + e]) 



which is the desired result. 



C. Expressing the symmetric uncertainty-focusing bound in parametric form 

EUR) = mt EMI 

0<A<1 1 - A 

. E (p)-pXR 
= mi max . 

0<A<1 p>0 1 - A 

To find the minimizing A, first observe that given A, the maximizing p is the solution to 

^ = XR. (47) 
op 

A solution exists because Eq is concave n [8]. If the solution is not unique, just pick the smallest solution. Call 
this solution to (1471 as p(X,R). Let 

g(X,R)=E (p(X,R))-p(X,R)XR. 
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Now, the goal is to minimize g ^j^ with respect to A. Take a derivative and set it to zero: 

g (X,R) + (l-X)^^ = 0. 

dg(X,R) dE (p(X,R))dp(X,R) , D dp(X,R) 

dX = dp dX—- XR ^X—- p{X ' R)R 

d ^ R k dE ^ R)) -XR)-p(X,R)R 



But 



So, solve for A* in 

Plugging in the definition of g gives 



dX dp 
-p(X,R)R. 

g(X*,R) = p(X*,R)R(l-X*). 
E (p(X*,R)) 



p(X*,R)R 



X* = 1 - A*. 



Which implies 

E (p(X*,R)) 
p(X*,R)R 

or R = ■ For the other part, just notice 



-p(A*, R)R(1 - X* 



1 - A*' 
= p(X*,R)R 

= E (p(X*,R)). 

Setting 77 = p(A*, i?) gives (JH). □ 



D. Proof of the low-rate approximation in Theorem \3.3\ 

First, solve for 2*? in d28l ) in terms of the reliability E^ ec (R') = a. This gives 2^ = 2 -°-p and so rj = 
a + log 2 (l — (31 — 2 a /3). Plugging into the i?' expression (l28l) gives the desired C'(a) tradeoff. 

It is worthwhile to investigate the behavior of this C'(a) for values of reliability a close (within a factor of 2) 
to the fundamental upper limit of — log 2 j3. Consider < e < — log 2 2 - . When a = (— log 2 /3 — e), 



2 

log 2 P 



log 2 p-e + log 2 (l - /3) - log 2 (l - P2-^°s 2 P) 
1 | log 2 (l-/3)-log 2 (l-2— 



C'(-log 2 /J 



log 2 /? - e 

(1 — 2~ e ) is a concave n function of e € [0, 1] and can thus be lower-bounded by ~e. This gives 

C'(-log 2)3 - £ ) > ( 1 + ^('-ffiH-^-') )-' 



> 1 + 2 



log,^ 1 )^ 



log,^ 1 ) 

Plugging in e = 2j3 r is valid as long as r > 2 ~ 1 ° | ^| — • This gives 



C'((-log 2 /3)-2^)> r ^-. (48) 
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E. Proof of Lemma \6.1\ 

An error can occur only when there have not been enough successful transmissions to get the i-th bit out in 
time. Applying the union bound to such events gives 

k=1 \*=r&i 



l^t=\4r\ * i — k 



< 



This establishes the desired result. 



F. The details in the proof of Theorem \3.3\ 



Notice that the event { r ; r k 1 < r \ 1 r k * } is just the error event for an ideal erasure-channel block 
code with block length n(/c) = d + [-^7] — [-^7] and a bit rate of R'(k) = ^F^TTTX] • This * s because it represents 
the event that the channel erases too many symbols. Let X(k) = 1 — , r 5 i_ r fc -, . Then ra(fc) = \{k)n(k) + d and 
G (X(k)R',X(k)R' + ^ray). Thus, for every ei > 0, there exists a di(ei) so that for all d > di(ei), 



< exp(-n(fc) [D(X(k)R' + ^- \ | 1 - /?) - ) 



exp(— <i 



L>(A(fc)i?' + ^||l-/3)- ei 



l-A(fc) 



)• 



Now, divide the events in d29l ) into two categories (illustrated in Figure IT3T > based on a critical value for A(fc) 

-' \R'\\l-f3) (j . - _ , £>(A'fl'||l-/3) 
1-A • J>ei n — «( 1 _>.)£,( fl /|| 1 _ J 3)- 



and fc. Let A* from (1271 ) be the A that minimizes the exponent D ^ A { ? JI 1 ^ . Set n = ^ /-i^x^nrff'in^^v ^ et ^ ^ e 



the largest k for which n(&) > n. For all k < k, 



p( d+r£i*-r*i -d+r*i-r*V 

< exp^n^^A^' + - /3 J - ei)) 



< exp(-ri(A;)(.D^i2' + |||l-/^ 



exp(-(n + (n(fc) - n)){D{R' + -||1 - /?) - ei)) 

n 

exp / fl (v g ||i- ffl C (iy + g||i-ffl-«, A cxp / _ + 2 _ _ <49) 



1-A* v D(R>\\l-(3) I \ n 

Meanwhile, for k >k, 

T!£^Z t i _ k D(X(k)R' + -fo||i-fl- ei 

^( ^ J tr~ < ■■ i— ) < exp(-d Vtt^ ) 

*+r&i-r&i "d+r^i-r^r i-*(*o 

^D(A(fc)# + |||l-/3) ei 
< exp(-d[ _]). (50) 
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If there were no § term above, then the terms (l50t could be bounded by using A* in place of X(k) since A* is the 
worst possible A. But since the divergence is continuous in its first argument and | is small, we bound them all 
by allowing for a small slop Explicitly, for every e 2 > 0, it is clear there exists a (^(£2) > so that for all 
d > ^2(^2) and k such that n(k) < n, we have 

m/ l^t=\±] Zt i-k . , „. ,D(X*R' 1-/3) ei 

Putting the two bounds (l49l > and (f5Tb together for d > max(di(ei), ^2(^2)) gives 
i 



V(B i (\-]+d)^B i ) 




< ^ex P (- ^^^ ^ + : g - » )) exp(-(n( fc ) - W + *||1 - fl - ,))] 

fc=i 

+(, _ fc + 1) exp(-d[(l - e 2 ) - r -^]) 



< exp( a { D{R'\\l-/3) Jj 



r-57l E eM-l(D(R' + 2fT 1 ||l - p) - ex)) 



ft 



fl(A*ft'||l-/?) / , £>(A*ft'||l-/?) Cl 



(l-A*)D(ft||l-/3) y ^ v V 1-A< 
G. Proof of Lemma \7.1\ 

That the transmission times Tj are iid is obvious since each depends on disjoint channel uses and the channel is 
memoryless and stationary. Before proving ( f33T >. it is useful to first establish 

V(Tj > tck) < exp(pRnck)[exp(-ckE (p, (/))]*. (52) 

The only way that the transmission time can be longer than tck for some integer t > 1 is if the block-lengfh-fc/c 
code cannot be correctly decoded to within a list of size I. The effective rate of the block code in nats is thus 

nek 

= R-. 

tck t 

Applying the list-decoding upper-bound (fT2l) on the probability of error for random block coding gives 

Tl 

T(Tj > tck) < exp(-tck[E {p, q) - pR-]) 

= eyLp(pRnck)exp(—tckEo(p,q)) 
= ex.p(pRnck)[exp(—ckEo(p,q))] t 
where this holds for all < p < 2 l . Pulling the constant into the exponent gives 

V{Tj>tck) < [exp(-ckE (p, q))f~ ^o(Ts) 
= [exp(— ckEo(p, q))\ E o(p^i 
< [exp(-cfcE (p,$)f~ r " s °^ 
= [eM-ckE (p,q))] t ~ lIipM] 
and this proves the desired result. 
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